Introduction

Data representation often involves various levels of explicitness. A “variable name” is a short token that can be used in programming to refer to some quantity or outcome or event of interest.

A data dictionary maps variable names to more explicit definitions of quantities, outcomes, events, and may include more information on context of measurement.

We have included a version of a CDC data dictionary. It is published as a multisheet Excel workbook.

Code to retrieve the names of the sheets, using excel_sheets:

library(readxl)
pa = system.file("cdc/VACCDataDictionary_v36_12082022.xlsx", package="teachCovidData")
shn = excel_sheets(pa)
shn
##  [1] "0. Notes"                        "1. Vaccinations_US_Jurisdiction"
##  [3] "2. Vaccinations_US_Trends"       "3. Vaccinations_US_Demograp"    
##  [5] "4. Vaccination_Age_Sex_Trends"   "5. Vaccinations_US_County"      
##  [7] "6. Vaccination_CaseTrends_AgeGp" "7. Booster Dose Eligibility"    
##  [9] "8. Primary and Booster Chart"    "9. Jurisdiction Abbreviations"

The first sheet is an overview:

p1 = read_xlsx(pa, 1)
knitr::kable(head(as.data.frame(p1)))
CDC COVID-19 Vaccine Administration and Distribution data …2 …3
Recent as of 11/17/2022 @ 8:00 AM ET NA NA
Historical data available for download: NA Associated CDC COVID Data Tracker Site:
COVID-19 Vaccinations in the United States, Jurisdiction Vaccinations in the United States
COVID-19 Vaccination Trends in the United States, National and Jurisdictional Vaccination Trends
COVID-19 Vaccination Age and Sex Trends in the United States, National and Jurisdictional Vaccination Demographic Trends
COVID-19 Vaccinations in the United States, County level Vaccinations by County

We process all sheets using process_datadict:

An interactive view of the second sheet:

library(DT)
datatable(pd[[2]])

To survey the dictionary in R, use datadict_app(pd) after performing the calculations above, or with example(datadict_app).