What is Happening with My Research Project Right Now?

Tools for Building and Sharing Dynamic Reports from Web-Based Data Collection

Chuck Cleland

September 26, 2018

About Me

Senior Research Scientist and Biostatistician at NYU Rory Meyers College of Nursing
Director of the Methods Core of the Center for Drug Use and HIV Research
Research and Methods Interests
- HIV prevention and care
- health disparities
- intervention development
- meta-analysis
- longitudinal data analysis
Contact
- email: cmc13@nyu.edu
- CV: https://clelandcm.github.io/Cleland-CV/

I think of myself as a collaborative research scientist

I have written many successful analysis plans and applied a variety of advanced methods

I have not been a developer of new methods

Goals for Today

Think about the problem
See what is possible
Don’t get hung up on the technical details
I am happy to help outside of this session with the technical details of anything I show

Toward Better Documentation and Communication of Analysis

Reproducible research
Put as much as possible in one place
Tailor communication for different audiences easily
- Show/hide elements which may help/detract depending on the audience
Easily produce documents in more than one format
- HTML
- MS-Word
- PDF
- Slides

Opportunities

Web-based data collection allows access to data as it is collected
Making the process of summarizing aspects of project data more efficient facilitates problem detection and solving before it’s too late
Sharing dynamic reports widely across the project team can make project management smoother and less error prone for staff and gets more eyes and minds on the data

Benefits

Reproducible research - steps from raw data through data management, analysis, visualization, results and discussion clearly documented in one place
For investigators, once an element is built into a report, no need to ask the data team to update it multiple times
- Investigators know where to go for current useful project summaries
For the data team, once an element is built, can be useful over the whole project
- Don’t need to respond to the same request many times

Challenges

Changes over time in project data elements
New variables, changed variables, removed variables
Changes over time in tools used to pull data from its cloud and take it through steps of data management, analysis, visualization, report writing, and sharing
Need to learn some coding
Don’t get stuck thinking only about the elements in the report - always be thinking about what’s not in it

What is REDCap

Secure web application for building and managing surveys and databases
Used by thousands of institutions in > 100 countries
Can be used to collect almost any kind of data
- 21 CFR Part 11
- Federal Information Security Management Act (FISMA)
- HIPAA-compliant

REDCap API

API is ‘Application Programming Interface’
REDCap API is an interface that allows external applications to connect to REDCap remotely
Used for programmatically retrieving or modifying data or settings within REDCap
Automated data imports/exports from a specified REDCap project In order to use the REDCap API for a given REDCap project, you must first be given a token by the administrator that is specific to your username for that particular project

Wanted for Dynamic Reports

Automated to increase efficiency and reduce errors
Can be generated and shared in a modest amount of time
Can include multiple elements such as headers, tables, figures, bulleted text
Done from one place - no going back and forth between different software
Can create reports in multiple formats
Allows for interaction with the data by users?

Tools

REDCap
- Web-based data collection
- Accessible database with all project data
R
- Use via RStudio
- Interact with REDCap
- Create automated summaries (dynamic reports)
GitHub
- Sharing reports with colleagues as web pages

REDCap

Data dictionary provides labelled data essentially on-demand
Once the dictionary is constructed, and as it is revised, variable and value labels are part of the REDCap project and will be exported
Everything collected on one place
Data management needed, but not lots of merging
Other data collection tools (e.g., Qualtrics) may be able to fill the same role as REDCap

R Steps

Pull data from REDCap project using API
Generate dynamic reports, including headers, tables, figures, and comments
Knit R markdown into html (or other formats)
RStudio is an integrated development environment (IDE) for R and makes using R easier in many ways

GitHub Steps

Create a repository for the project (one time)
Push regularly updated html files to the repository
Repositories are public (but see private repository options for academic institutions and if willing to pay)
There are other ways reports could be shared or hosted
- Email, Box, Dropbox
- A blog
- I use MS-Excel and a private folder if showing unique participants
GitHub is much more than a way to host web pages

What To Include in Reports

Date
CONSORT flow diagram
Background on the scoring of measures
Reliability coefficients, frequency tables, crosstabulations, descriptive statistics by group
Time between key study activities
Completion/non-completion of assessment and intervention activities (might include identification of individuals, so not for public)

Useful R Packages

Pull data from its cloud
- REDCapAPI
Data management
- tidyverse, data.table
Present results
- knitr, pander, ggplot2, igraph, Rgraphviz
Build interactive reports
- Shiny

Bringing REDCap Data into R

API access to REDCap requires a “token” for security
I obtained a token for my REDCap project from the NYU School of Medicine, which administers REDCap
REDCap itself shows ways to export and import project data using API for multiple languages (R, Python, others)
I prefer to use an R package called REDCapAPI for this purpose

Getting the Data with REDCapAPI

My token is inside an MS-Excel file that only I can access (not public)
I establish a connection with the NYUSOM REDCap server using my token
I export project records using a function in the REDCapAPI package
I need to do an advanced login to establish a VPN connection with NYUSOM for this step to work

Successful VPN Connection

R Code to Export Records from REDCap

library(redcapAPI)
library(openxlsx)
library(data.table)
tkns <- read.xlsx("c:/chuck/NYU/NoART/REDCap/tkns.xlsx")
rcon <- redcapConnection(url = 'https://openredcap.nyumc.org/apps/redcap/api/', 
                         token=tkns[4,3])
hth <- data.table(exportRecords(rcon, batch.size = 700))

Data in R, Now What?

Summarize, which will involve various kinds of data management
Organize into sections
- Screening
- Baseline Interview
- Follow-ups
Address key questions and metrics, but often less is better
- How many people have gone through each step of screening?
- How many eligible people enroll?
- What is the current rate of follow-up?

R Markdown

R Markdown is a unified authoring framework for data science
Combines code, results, and commentary
R Markdown files are the source code for rich, reproducible documents
Contain text written in markdown, a set of conventions for formatting plain text:
- bold and italic text
- lists
- headers (e.g., section titles)
- hyperlinks
- and much more

Minimal R Markdown Example

---
title: "My Minimal Example"
author: "Chuck Cleland"
date: "March 18, 2018"
output: html_document
---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE)

```

# This is a top-level header

This is an R Markdown document. Markdown is a simple formatting syntax  
for authoring HTML, PDF, and MS Word documents. For more details on  
using R Markdown see <http://rmarkdown.rstudio.com>.

## This is a second-level header

```{r cars}

summary(mtcars$mpg)

```

Some comments about these data:

- These are data on 32 cars
- Variables include miles per gallon, cylinders, and weight
- Heavier cars go fewer miles per gallon of fuel

Rendering R Markdown

To transform your markdown file into an HTML, PDF, or Word document, click the “Knit” icon that appears above your file in the scripts editor.
A drop down menu will let you select the type of output that you want.

Minimal Example Result

YAML Header

This first part of an R Markdown file is called the YAML header
YAML (rhymes with “camel”) is a human-friendly, cross language, Unicode based data serialization language
Key value pairs that control aspects of how the document is rendered
- Output format
- Output styling
- Table of contents
- Title, Author, Date

---
title: "Weekly Project Report"
author: "Chuck Cleland"
date: "March 23, 2018"
output:
  html_document:
    highlight: tango
    theme: cerulean
    toc: yes
    toc_depth: 5
---

R Code Chunks

In R markdown files, sequences of character signify that a “chunk” of R code is starting and ending
At the start of the chunk, chunk-specific options can be set to control things like figure size and whether R code and/or results are shown in the final document or not
R code chunk example:

```{r}

summary(mtcars)

Model_one <- lm(mpg ~ wt, data = mtcars)

summary(Model_one)

```

Tip

A shortcut to create a new empty code chunk: Ctrl + Alt + I

Inline R Expressions

Outside of R code chunks, R expressions can be used to fill in values that change over time in headers or other text in the report

Examples:

Figures

Convey a large amount of information concisely
Visually interesting and engage the reader in a different way than text
Variety of figures useful in project reports
- Flow diagrams
- Distributions of key variables: barcharts, histograms, boxplots, density plots
- Peer recruitment diagrams
- Eligibility over time
- Time to complete study activities: interviews, biological testing
I use R tools to prepare data for figures (tidyverse, data.table) and to produce the figures (ggplot2, igraph)

Eligibility Rate Over Time

Day <- seq(as.Date("2014/5/1"), as.Date("2015/4/20"), "days")

Scr_Data <- data.frame(Date = sample(Day, 950, replace = TRUE),
                       Scr_ELG = sample(c('Eligible','Ineligible'), 
                                        950, replace = TRUE)) %>%
  arrange(Date) %>%
  mutate(Day = cut(Date, breaks = "days"),
         Month = cut(Date, breaks = "months"),
         Total = row_number(),
         Eligible = cumsum(Scr_ELG == "Eligible"),
         Inelgible = cumsum(Scr_ELG == "Ineligible"),
         Cummulative = Eligible / Total)

head(Scr_Data) %>% 
  knitr::kable()

Date	Scr_ELG	Day	Month	Total	Eligible	Inelgible	Cummulative
2014-05-01	Ineligible	2014-05-01	2014-05-01	1	0	1	0.0000000
2014-05-02	Eligible	2014-05-02	2014-05-01	2	1	1	0.5000000
2014-05-02	Eligible	2014-05-02	2014-05-01	3	2	1	0.6666667
2014-05-02	Ineligible	2014-05-02	2014-05-01	4	2	2	0.5000000
2014-05-02	Ineligible	2014-05-02	2014-05-01	5	2	3	0.4000000
2014-05-02	Eligible	2014-05-02	2014-05-01	6	3	3	0.5000000

library(tidyverse)

p1 <- Scr_Data %>%
  ggplot(aes(x = Date, y = Cummulative)) + 
  geom_point(pch=19, color = "blue", size = 3, alpha = .33) +
  labs(x = "", y = "Proportion Eligible") +
  scale_y_continuous(limits=c(0,1)) +
  theme_minimal()

How Long for Blood Results?

Day1 <- sample(seq(as.Date("2014/5/1"), as.Date("2015/4/20"), "days"), 
               950, replace = TRUE)
Day2 <- Day1 + round(rchisq(950, df = 9))

Scr_Data <- data.frame(Day1, Day2)

head(Scr_Data)

        Day1       Day2
1 2014-08-15 2014-08-25
2 2015-01-15 2015-01-21
3 2015-01-21 2015-01-31
4 2015-04-03 2015-04-12
5 2014-12-25 2015-01-01
6 2015-01-25 2015-02-12

Scr_Data %>%
  mutate(Days_Between = difftime(Day2, Day1, units = "days")) %>%
  ggplot(aes(x = Days_Between)) +
  geom_density(alpha = .3, fill="#4169E1") +
  labs(x = "Days Between Start of Screening and Blood Results") +
  theme_minimal()

Time to Complete Interview

head(DF)

                Start                 End
1 2016-01-03 13:49:00 2016-01-03 14:50:33
2 2016-11-02 13:19:00 2016-11-02 14:27:14
3 2017-07-03 13:21:00 2017-07-03 14:21:04
4 2016-08-14 10:13:00 2016-08-14 11:32:07
5 2016-12-03 15:47:00 2016-12-03 16:38:59
6 2017-08-10 13:56:00 2017-08-10 15:13:19

DF %>% 
  mutate(Minutes = as.numeric(difftime(End, Start, units = "mins"))) %>%
  ggplot(aes(x = Minutes)) + 
  geom_density(fill="#4169E1", alpha = .33) +
  theme_minimal()

Tables

Frequency distribution for key variables
Crosstabulations of key variables
Descriptive statistics for continuous variables
Descriptive statistics by a grouping variable
R packages can improve the look of tables in HTML files
- knitr
- pander
- xtable

Frequency Table

library(tidyverse)
library(pander)
library(descr)

panderOptions('table.split.table', Inf)

with(iris, 
     freq(Species, plot = FALSE)) %>%
  pander(digits = 1)

	Frequency	Percent
setosa	50	33
versicolor	50	33
virginica	50	33
Total	150	100

Crosstabulation

library(tidyverse)
library(pander)
library(descr)

panderOptions('table.split.table', Inf)

with(iris, 
     CrossTable(Species, Sepal.Width > 3,
                prop.chisq = FALSE,
                prop.t = FALSE,
                prop.c = FALSE)) %>%
  pander(digits = 1)

Species	Sepal.Width > 3 FALSE	TRUE	Total
setosa N Row(%)	8 16.0%	42 84.0%	50 33.3%
versicolor N Row(%)	42 84.0%	8 16.0%	50 33.3%
virginica N Row(%)	33 66.0%	17 34.0%	50 33.3%
Total	83	67	150

Descriptives for Multiple Variables

library(tidyverse)

iris %>%
  gather(Variable, Value, Sepal.Length:Petal.Width) %>%
  group_by(Variable) %>%
  summarize(n = n(), 
            Mean = round(mean(Value), 1),
            SD = round(sd(Value), 1),
            Median = median(Value),
            IQR = IQR(Value),
            Min = min(Value),
            Max = max(Value)) %>%
  knitr::kable()

Variable	n	Mean	SD	Median	IQR	Min	Max
Petal.Length	150	3.8	1.8	4.35	3.5	1.0	6.9
Petal.Width	150	1.2	0.8	1.30	1.5	0.1	2.5
Sepal.Length	150	5.8	0.8	5.80	1.3	4.3	7.9
Sepal.Width	150	3.1	0.4	3.00	0.5	2.0	4.4

Descriptives By Grouping Variable

iris %>%
  gather(Variable, Value, 
         Sepal.Length:Petal.Width) %>%
  group_by(Variable, Species) %>%
  summarize(n = n(), 
            Mean = round(mean(Value), 1),
            SD = round(sd(Value), 1),
            Median = median(Value),
            IQR = IQR(Value),
            Min = min(Value),
            Max = max(Value)) %>%
  knitr::kable()

Variable	Species	n	Mean	SD	Median	IQR	Min	Max
Petal.Length	setosa	50	1.5	0.2	1.50	0.175	1.0	1.9
Petal.Length	versicolor	50	4.3	0.5	4.35	0.600	3.0	5.1
Petal.Length	virginica	50	5.6	0.6	5.55	0.775	4.5	6.9
Petal.Width	setosa	50	0.2	0.1	0.20	0.100	0.1	0.6
Petal.Width	versicolor	50	1.3	0.2	1.30	0.300	1.0	1.8
Petal.Width	virginica	50	2.0	0.3	2.00	0.500	1.4	2.5
Sepal.Length	setosa	50	5.0	0.4	5.00	0.400	4.3	5.8
Sepal.Length	versicolor	50	5.9	0.5	5.90	0.700	4.9	7.0
Sepal.Length	virginica	50	6.6	0.6	6.50	0.675	4.9	7.9
Sepal.Width	setosa	50	3.4	0.4	3.40	0.475	2.3	4.4
Sepal.Width	versicolor	50	2.8	0.3	2.80	0.475	2.0	3.4
Sepal.Width	virginica	50	3.0	0.3	3.00	0.375	2.2	3.8

Follow-Up Rates

To summarize during the study, need to define when each participant is:
- not yet eligible for FU;
- in the window period for FU
- time for FU is up
Rate of greatest interest is the follow-up rate when the window period is over
Participants move through table from top to bottom

Further Customizing Tables

library(knitr)
library(tidyverse)
library(kableExtra)

Table 1

dt <- mtcars[1:5, 1:6]

kable(dt, "html") %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(5:7, bold = T) %>%
  row_spec(3:5, bold = T, color = "white", background = "#D7261E")

	mpg	cyl	disp	hp	drat	wt
Mazda RX4	21.0	6	160	110	3.90	2.620
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875
Datsun 710	22.8	4	108	93	3.85	2.320
Hornet 4 Drive	21.4	6	258	110	3.08	3.215
Hornet Sportabout	18.7	8	360	175	3.15	3.440

Table 2

iris[1:10, ] %>%
  mutate_if(is.numeric, function(x) {
    cell_spec(x, "html", bold = T, 
              color = spec_color(x, end = 0.9),
              font_size = spec_font_size(x))}) %>%
  mutate(Species = cell_spec(
    Species, "html", color = "white", bold = T,
    background = spec_color(1:10, end = 0.9, 
                            option = "A", direction = -1))) %>%
  kable("html", escape = F, align = "c") %>%
  kable_styling("striped", full_width = F)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa

CONSORT Flow Diagram

library(igraph)

DF <- data.frame(from = c('Screening\nInitiated','Screening\nInitiated',
                          'Eligible','Eligible','Enrolled','Enrolled'),
                 to = c('Eligible','Not Eligible','Enrolled','Declined',
                        'Intervention','Control'))

mynodes <- data.frame(Fixed_Name = c('Screening\nInitiated',
                                     'Eligible','Not Eligible',
                                     'Enrolled','Declined',
                                     'Intervention','Control'),
                      Count = c(500, 300, 194, 268, 32, 130, 138))

mynodes$Label = paste(mynodes$Fixed_Name, 
                      "\n(n=", mynodes$Count, ")", 
                      sep = "")

my_ig <- graph.data.frame(DF, vertices = mynodes)

plot(my_ig, layout=layout_as_tree, vertex.shape = "circle", 
     vertex.label = V(my_ig)$Label, 
     vertex.color = "#EAE5EB",
     vertex.size=60, asp = 1.2, margin = c(0.2,0.2,0.2,0.2))

CONSORT Flow Diagram Result

Scoring Instruments

Common task is scoring an instrument composed of multiple Likert-type items
Data are often imported with each item in a separate column and labels rather than numbers
Below are a few lines of hypothetical item data:

head(DF)

  V1 V2 V3 V4 V5 V6 V7 V8 V9 PID
1  A  N  D  A  D  N  D  N  N   1
2  N SA  D  N SD  A SD  A  N   2
3  N  N  N  N  N  A  D  A  N   3
4 SA SA  D SA SD SA SD SA SA   4
5  A  A  D  N  D  A  D  A  A   5
6  A  N SD  A  D SA  D  A  A   6

Wrangling Instrument Items

DF %>%
  gather(Item, Response, -PID) %>%
  mutate(Response = ordered(Response, 
                            levels = c('SD','D','N','A','SA')),
   Response = as.numeric(Response) - 1,
   Response = ifelse(Item %in% c('V3','V5','V7'),
  4 - Response, Response)) %>%
  spread(Item, Response) %>%
  head()

  PID V1 V2 V3 V4 V5 V6 V7 V8 V9
1   1  3  2  3  3  3  2  3  2  2
2   2  2  4  3  2  4  3  4  3  2
3   3  2  2  2  2  2  3  3  3  2
4   4  4  4  3  4  4  4  4  4  4
5   5  3  3  3  2  3  3  3  3  3
6   6  3  2  4  3  3  4  3  3  3

Calculate Reliability

myalpha <- DF %>%
  gather(Item, Response, -PID) %>%
  mutate(Response = ordered(Response, 
                            levels = c('SD','D','N','A','SA')),
         Response = as.numeric(Response) - 1,
         Response = ifelse(Item %in% c('V3','V5','V7'), 
                           4 - Response, Response)) %>%
  spread(Item, Response) %>%
  select(-PID) %>%
  psych::alpha()

Show Overall Reliability and Item Statistics

myalpha[1:2] %>% pander()

total:

raw_alpha	std.alpha	G6(smc)	average_r	S/N	ase	mean	sd	median_r
0.7937	0.7939	0.778	0.2997	3.851	0.01511	2.572	0.4559	0.3051

alpha.drop:

	raw_alpha	std.alpha	G6(smc)	average_r	S/N	alpha se	var.r	med.r
V1	0.7751	0.7754	0.7548	0.3014	3.452	0.01659	0.001268	0.3083
V2	0.7724	0.7726	0.7517	0.298	3.397	0.0168	0.001154	0.3051
V3	0.771	0.7714	0.7503	0.2967	3.375	0.0169	0.001196	0.3051
V4	0.773	0.7732	0.7527	0.2988	3.408	0.01675	0.001253	0.3022
V5	0.7678	0.7679	0.7477	0.2926	3.308	0.01714	0.001282	0.2903
V6	0.7746	0.7749	0.7545	0.3008	3.442	0.01664	0.001188	0.3051
V7	0.7756	0.7758	0.7556	0.302	3.461	0.01656	0.001299	0.305
V8	0.7749	0.775	0.7548	0.301	3.445	0.01661	0.001284	0.3083
V9	0.7788	0.779	0.7599	0.3058	3.525	0.01633	0.001397	0.3114

Repositories

One Repository

Upload Files and Commit Changes

Interactive Reports

R has ways of building web pages that allow visitors to do more than simply scrolling
- Leaflet for dynamic maps
- flexdashboard to publish groups of related data visualizations
- DataTables displays R matrices or data frames as interactive HTML tables
- Plotly to easily translate your ggplot2 graphics to an interactive web-based version
- Shiny makes it easy to build interactive web apps straight from R
For other tools see https://www.htmlwidgets.org/

Plotly

library(ggplot2)
library(plotly)
p <- ggplot(data = diamonds, aes(x = cut, fill = clarity)) +
            geom_bar(position = "dodge")
ggplotly(p)

DataTable

library(DT)

datatable(iris, options = list(pageLength = 5))

Can I Do This with Other Tools?

Can you easily get data from its cloud?
Can you write code to wrangle the data into useful summaries which can be run as often as needed?
Can you blend data analysis code with other report elements?
Do you have a good way of sharing the result across the research project team?

Resources

I used reveal.js to create this presentation

Questions

Feel free to email me with any questions (cmc13@nyu.edu)