Loading...
Thumbnail Image
Publication

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C) [preprint]

Thomas, Jason A.
Foraker, Randi E.
Zamstein, Noa
Payne, Philip R.O.
Wilcox, Adam B.
Embargo Expiration Date
Link to Full Text
Abstract

Objective To evaluate whether synthetic data derived from a national COVID-19 data set could be used for geospatial and temporal epidemic analyses.

Materials and Methods Using an original data set (n=1,854,968 SARS-CoV-2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip-code level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated.

Results In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean=2.9±2.4; max=16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n=171) and for all unsuppressed zip codes (n=5,819), respectively. In small sample sizes, synthetic data utility was notably decreased.

Discussion Analyses on the population-level and of densely-tested zip codes (which contained most of the data) were similar between original and synthetically-derived data sets. Analyses of sparsely-tested populations were less similar and had more data suppression.

Conclusion In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression -an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.

Source

medRxiv 2021.07.06.21259051; doi: https://doi.org/10.1101/2021.07.06.21259051. Link to preprint on medRxiv.

Year of Medical School at Time of Visit
Sponsors
Dates of Travel
DOI
10.1101/2021.07.06.21259051
PubMed ID
34268525
Other Identifiers
Notes

This article is a preprint. Preprints are preliminary reports of work that have not been certified by peer review.

The UMass Center for Clinical and Translational Science (UMCCTS), UL1TR001453, helped fund this study.

Funding and Acknowledgements
Corresponding Author
Related Resources

Now published in Journal of the American Medical Informatics Association doi: 10.1093/jamia/ocac045

Related Resources
Repository Citation
Rights
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.