What is Happening with My Research Project Right Now?

Tools for Building and Sharing Dynamic Reports from Web-Based Data Collection

Chuck Cleland

September 26, 2018

About Me

More

I think of myself as a collaborative research scientist
I have written many successful analysis plans and applied a variety of advanced methods
I have not been a developer of new methods

Goals for Today

  • Think about the problem
  • See what is possible
  • Don’t get hung up on the technical details
  • I am happy to help outside of this session with the technical details of anything I show

Toward Better Documentation and Communication of Analysis

  • Reproducible research
  • Put as much as possible in one place
  • Tailor communication for different audiences easily
    • Show/hide elements which may help/detract depending on the audience
  • Easily produce documents in more than one format
    • HTML
    • MS-Word
    • PDF
    • Slides

Opportunities

  • Web-based data collection allows access to data as it is collected
  • Making the process of summarizing aspects of project data more efficient facilitates problem detection and solving before it’s too late
  • Sharing dynamic reports widely across the project team can make project management smoother and less error prone for staff and gets more eyes and minds on the data

Benefits

  • Reproducible research - steps from raw data through data management, analysis, visualization, results and discussion clearly documented in one place
  • For investigators, once an element is built into a report, no need to ask the data team to update it multiple times
    • Investigators know where to go for current useful project summaries
  • For the data team, once an element is built, can be useful over the whole project
    • Don’t need to respond to the same request many times

Challenges

  • Changes over time in project data elements
  • New variables, changed variables, removed variables
  • Changes over time in tools used to pull data from its cloud and take it through steps of data management, analysis, visualization, report writing, and sharing
  • Need to learn some coding
  • Don’t get stuck thinking only about the elements in the report - always be thinking about what’s not in it

What is REDCap

  • Secure web application for building and managing surveys and databases
  • Used by thousands of institutions in > 100 countries
  • Can be used to collect almost any kind of data
    • 21 CFR Part 11
    • Federal Information Security Management Act (FISMA)
    • HIPAA-compliant

REDCap API

  • API is ‘Application Programming Interface’
  • REDCap API is an interface that allows external applications to connect to REDCap remotely
  • Used for programmatically retrieving or modifying data or settings within REDCap
  • Automated data imports/exports from a specified REDCap project In order to use the REDCap API for a given REDCap project, you must first be given a token by the administrator that is specific to your username for that particular project

Wanted for Dynamic Reports

  • Automated to increase efficiency and reduce errors
  • Can be generated and shared in a modest amount of time
  • Can include multiple elements such as headers, tables, figures, bulleted text
  • Done from one place - no going back and forth between different software
  • Can create reports in multiple formats
  • Allows for interaction with the data by users?

Tools

  • REDCap
    • Web-based data collection
    • Accessible database with all project data
  • R
    • Use via RStudio
    • Interact with REDCap
    • Create automated summaries (dynamic reports)
  • GitHub
    • Sharing reports with colleagues as web pages

REDCap

  • Data dictionary provides labelled data essentially on-demand
  • Once the dictionary is constructed, and as it is revised, variable and value labels are part of the REDCap project and will be exported
  • Everything collected on one place
  • Data management needed, but not lots of merging
  • Other data collection tools (e.g., Qualtrics) may be able to fill the same role as REDCap

R Steps

  • Pull data from REDCap project using API
  • Generate dynamic reports, including headers, tables, figures, and comments
  • Knit R markdown into html (or other formats)
  • RStudio is an integrated development environment (IDE) for R and makes using R easier in many ways

GitHub Steps

  • Create a repository for the project (one time)
  • Push regularly updated html files to the repository
  • Repositories are public (but see private repository options for academic institutions and if willing to pay)
  • There are other ways reports could be shared or hosted
    • Email, Box, Dropbox
    • A blog
    • I use MS-Excel and a private folder if showing unique participants
  • GitHub is much more than a way to host web pages

What To Include in Reports

  • Date
  • CONSORT flow diagram
  • Background on the scoring of measures
  • Reliability coefficients, frequency tables, crosstabulations, descriptive statistics by group
  • Time between key study activities
  • Completion/non-completion of assessment and intervention activities (might include identification of individuals, so not for public)

Useful R Packages

Bringing REDCap Data into R

  • API access to REDCap requires a “token” for security
  • I obtained a token for my REDCap project from the NYU School of Medicine, which administers REDCap
  • REDCap itself shows ways to export and import project data using API for multiple languages (R, Python, others)
  • I prefer to use an R package called REDCapAPI for this purpose

Getting the Data with REDCapAPI

  • My token is inside an MS-Excel file that only I can access (not public)
  • I establish a connection with the NYUSOM REDCap server using my token
  • I export project records using a function in the REDCapAPI package
  • I need to do an advanced login to establish a VPN connection with NYUSOM for this step to work

Langone Remote Login

Successful VPN Connection

R Code to Export Records from REDCap

Data in R, Now What?

  • Summarize, which will involve various kinds of data management
  • Organize into sections
    • Screening
    • Baseline Interview
    • Follow-ups
  • Address key questions and metrics, but often less is better
    • How many people have gone through each step of screening?
    • How many eligible people enroll?
    • What is the current rate of follow-up?

R Markdown

  • R Markdown is a unified authoring framework for data science
  • Combines code, results, and commentary
  • R Markdown files are the source code for rich, reproducible documents
  • Contain text written in markdown, a set of conventions for formatting plain text:
    • bold and italic text
    • lists
    • headers (e.g., section titles)
    • hyperlinks
    • and much more

Minimal R Markdown Example

---
title: "My Minimal Example"
author: "Chuck Cleland"
date: "March 18, 2018"
output: html_document
---
```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE)

```
# This is a top-level header

This is an R Markdown document. Markdown is a simple formatting syntax  
for authoring HTML, PDF, and MS Word documents. For more details on  
using R Markdown see <http://rmarkdown.rstudio.com>.

## This is a second-level header
```{r cars}

summary(mtcars$mpg)

```
Some comments about these data:

- These are data on 32 cars
- Variables include miles per gallon, cylinders, and weight
- Heavier cars go fewer miles per gallon of fuel

Rendering R Markdown

  • To transform your markdown file into an HTML, PDF, or Word document, click the “Knit” icon that appears above your file in the scripts editor.
  • A drop down menu will let you select the type of output that you want.

Minimal Example Result

YAML Header

  • This first part of an R Markdown file is called the YAML header
  • YAML (rhymes with “camel”) is a human-friendly, cross language, Unicode based data serialization language
  • Key value pairs that control aspects of how the document is rendered
    • Output format
    • Output styling
    • Table of contents
    • Title, Author, Date
---
title: "Weekly Project Report"
author: "Chuck Cleland"
date: "March 23, 2018"
output:
  html_document:
    highlight: tango
    theme: cerulean
    toc: yes
    toc_depth: 5
---

R Code Chunks

  • In R markdown files, sequences of character signify that a “chunk” of R code is starting and ending
  • At the start of the chunk, chunk-specific options can be set to control things like figure size and whether R code and/or results are shown in the final document or not
  • R code chunk example:
```{r}

summary(mtcars)

Model_one <- lm(mpg ~ wt, data = mtcars)

summary(Model_one)

```

Tip

  • A shortcut to create a new empty code chunk: Ctrl + Alt + I

Inline R Expressions

  • Outside of R code chunks, R expressions can be used to fill in values that change over time in headers or other text in the report

Examples:

Figures

  • Convey a large amount of information concisely
  • Visually interesting and engage the reader in a different way than text
  • Variety of figures useful in project reports
    • Flow diagrams
    • Distributions of key variables: barcharts, histograms, boxplots, density plots
    • Peer recruitment diagrams
    • Eligibility over time
    • Time to complete study activities: interviews, biological testing
  • I use R tools to prepare data for figures (tidyverse, data.table) and to produce the figures (ggplot2, igraph)

Eligibility Rate Over Time

Date Scr_ELG Day Month Total Eligible Inelgible Cummulative
2014-05-01 Ineligible 2014-05-01 2014-05-01 1 0 1 0.0000000
2014-05-02 Eligible 2014-05-02 2014-05-01 2 1 1 0.5000000
2014-05-02 Eligible 2014-05-02 2014-05-01 3 2 1 0.6666667
2014-05-02 Ineligible 2014-05-02 2014-05-01 4 2 2 0.5000000
2014-05-02 Ineligible 2014-05-02 2014-05-01 5 2 3 0.4000000
2014-05-02 Eligible 2014-05-02 2014-05-01 6 3 3 0.5000000

How Long for Blood Results?

        Day1       Day2
1 2014-08-15 2014-08-25
2 2015-01-15 2015-01-21
3 2015-01-21 2015-01-31
4 2015-04-03 2015-04-12
5 2014-12-25 2015-01-01
6 2015-01-25 2015-02-12

Time to Complete Interview

                Start                 End
1 2016-01-03 13:49:00 2016-01-03 14:50:33
2 2016-11-02 13:19:00 2016-11-02 14:27:14
3 2017-07-03 13:21:00 2017-07-03 14:21:04
4 2016-08-14 10:13:00 2016-08-14 11:32:07
5 2016-12-03 15:47:00 2016-12-03 16:38:59
6 2017-08-10 13:56:00 2017-08-10 15:13:19

Tables

  • Frequency distribution for key variables
  • Crosstabulations of key variables
  • Descriptive statistics for continuous variables
  • Descriptive statistics by a grouping variable
  • R packages can improve the look of tables in HTML files
    • knitr
    • pander
    • xtable

Frequency Table

  Frequency Percent
setosa 50 33
versicolor 50 33
virginica 50 33
Total 150 100

Crosstabulation

 
Species
Sepal.Width > 3
FALSE
 
TRUE
 
Total
setosa
N
Row(%)
 
8
16.0%
 
42
84.0%
 
50
33.3%
versicolor
N
Row(%)
 
42
84.0%
 
8
16.0%
 
50
33.3%
virginica
N
Row(%)
 
33
66.0%
 
17
34.0%
 
50
33.3%
Total 83 67 150

Descriptives for Multiple Variables

Variable n Mean SD Median IQR Min Max
Petal.Length 150 3.8 1.8 4.35 3.5 1.0 6.9
Petal.Width 150 1.2 0.8 1.30 1.5 0.1 2.5
Sepal.Length 150 5.8 0.8 5.80 1.3 4.3 7.9
Sepal.Width 150 3.1 0.4 3.00 0.5 2.0 4.4

Descriptives By Grouping Variable

Variable Species n Mean SD Median IQR Min Max
Petal.Length setosa 50 1.5 0.2 1.50 0.175 1.0 1.9
Petal.Length versicolor 50 4.3 0.5 4.35 0.600 3.0 5.1
Petal.Length virginica 50 5.6 0.6 5.55 0.775 4.5 6.9
Petal.Width setosa 50 0.2 0.1 0.20 0.100 0.1 0.6
Petal.Width versicolor 50 1.3 0.2 1.30 0.300 1.0 1.8
Petal.Width virginica 50 2.0 0.3 2.00 0.500 1.4 2.5
Sepal.Length setosa 50 5.0 0.4 5.00 0.400 4.3 5.8
Sepal.Length versicolor 50 5.9 0.5 5.90 0.700 4.9 7.0
Sepal.Length virginica 50 6.6 0.6 6.50 0.675 4.9 7.9
Sepal.Width setosa 50 3.4 0.4 3.40 0.475 2.3 4.4
Sepal.Width versicolor 50 2.8 0.3 2.80 0.475 2.0 3.4
Sepal.Width virginica 50 3.0 0.3 3.00 0.375 2.2 3.8

Follow-Up Rates

  • To summarize during the study, need to define when each participant is:
    • not yet eligible for FU;
    • in the window period for FU
    • time for FU is up
  • Rate of greatest interest is the follow-up rate when the window period is over
  • Participants move through table from top to bottom

Further Customizing Tables

Table 1

mpg cyl disp hp drat wt
Mazda RX4 21.0 6 160 110 3.90 2.620
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875
Datsun 710 22.8 4 108 93 3.85 2.320
Hornet 4 Drive 21.4 6 258 110 3.08 3.215
Hornet Sportabout 18.7 8 360 175 3.15 3.440

Table 2

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa

CONSORT Flow Diagram

CONSORT Flow Diagram Result

Scoring Instruments

  • Common task is scoring an instrument composed of multiple Likert-type items
  • Data are often imported with each item in a separate column and labels rather than numbers
  • Below are a few lines of hypothetical item data:
  V1 V2 V3 V4 V5 V6 V7 V8 V9 PID
1  A  N  D  A  D  N  D  N  N   1
2  N SA  D  N SD  A SD  A  N   2
3  N  N  N  N  N  A  D  A  N   3
4 SA SA  D SA SD SA SD SA SA   4
5  A  A  D  N  D  A  D  A  A   5
6  A  N SD  A  D SA  D  A  A   6

Wrangling Instrument Items

  PID V1 V2 V3 V4 V5 V6 V7 V8 V9
1   1  3  2  3  3  3  2  3  2  2
2   2  2  4  3  2  4  3  4  3  2
3   3  2  2  2  2  2  3  3  3  2
4   4  4  4  3  4  4  4  4  4  4
5   5  3  3  3  2  3  3  3  3  3
6   6  3  2  4  3  3  4  3  3  3

Calculate Reliability

Show Overall Reliability and Item Statistics

  • total:

    raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
    0.7937 0.7939 0.778 0.2997 3.851 0.01511 2.572 0.4559 0.3051
    • alpha.drop:

        raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
      V1 0.7751 0.7754 0.7548 0.3014 3.452 0.01659 0.001268 0.3083
      V2 0.7724 0.7726 0.7517 0.298 3.397 0.0168 0.001154 0.3051
      V3 0.771 0.7714 0.7503 0.2967 3.375 0.0169 0.001196 0.3051
      V4 0.773 0.7732 0.7527 0.2988 3.408 0.01675 0.001253 0.3022
      V5 0.7678 0.7679 0.7477 0.2926 3.308 0.01714 0.001282 0.2903
      V6 0.7746 0.7749 0.7545 0.3008 3.442 0.01664 0.001188 0.3051
      V7 0.7756 0.7758 0.7556 0.302 3.461 0.01656 0.001299 0.305
      V8 0.7749 0.775 0.7548 0.301 3.445 0.01661 0.001284 0.3083
      V9 0.7788 0.779 0.7599 0.3058 3.525 0.01633 0.001397 0.3114

Sharing the Report Via GitHub

  • Git and GitHub are mainly for version control and collaboration
  • GitHub provides a fairly easy and free way to host static web pages (not interactive, no embedded apps)
  • Work organized into repositories of files, including html
  • Repository can be set up to show web pages

Repositories

One Repository

Upload Files and Commit Changes

Interactive Reports

  • R has ways of building web pages that allow visitors to do more than simply scrolling
    • Leaflet for dynamic maps
    • flexdashboard to publish groups of related data visualizations
    • DataTables displays R matrices or data frames as interactive HTML tables
    • Plotly to easily translate your ggplot2 graphics to an interactive web-based version
    • Shiny makes it easy to build interactive web apps straight from R
  • For other tools see https://www.htmlwidgets.org/

Plotly

DataTable

Can I Do This with Other Tools?

  • Can you easily get data from its cloud?
  • Can you write code to wrangle the data into useful summaries which can be run as often as needed?
  • Can you blend data analysis code with other report elements?
  • Do you have a good way of sharing the result across the research project team?

Resources

I used reveal.js to create this presentation

Questions

Feel free to email me with any questions (cmc13@nyu.edu)