Loading...
Thumbnail Image
Publication

Causal Inference via Electronic Health Records in the National Clinical Cohort Collaborative: Challenges and Solutions in Long COVID Research [preprint]

Butzin-Dozier, Zachary
Ji, Yunwen
Wang, Lin-Chiun
Anzalone, A Jerrod
Hurwitz, Eric
Patel, Rena C
van der Laan, Mark J
Colford, John M
Hubbard, Alan E
Embargo Expiration Date
Abstract

Observational analyses of electronic health record (EHR) data using databases such as the National Clinical Cohort Collaborative include unique challenges for researchers seeking causal inferences, particularly when evaluating subjectively-defined outcomes like Long COVID. We explore several challenges and describe potential solutions. 1. Lack of true negatives: Many diagnoses and conditions either have a positive indicator or a missing status, requiring investigators to carefully consider which patients are likely negative for this condition. 2. Differential monitoring: EHR data include nonrandom missingness driven by patients engaging with the healthcare system at different rates, which is often related to both the exposure and outcome of interest. 3. Bias: EHR data sources face many biases, but are particularly vulnerable to informative missingness, differential monitoring, and model misspecification. 4. Large sample size: High precision (i.e., narrow confidence intervals) paired with potential bias leads to a high risk of incorrectly rejecting the null hypothesis. 5. Defining index time: It is important that investigators deliberately define index time (i.e., , baseline) to ensure that they only adjust for baseline confounders and do not adjust for (or condition on) factors that are affected by the exposure of interest (i.e., colliders or mediators). 6. Parameter selection: Investigators should only select parameters that are supported by the data distribution. This manuscript provides an overview of these challenges and solutions, using both simulated data and real-world data, with the outcome of Long COVID as the running example.

Source

Butzin-Dozier Z, Ji Y, Wang LC, Anzalone AJ, Hurwitz E, Patel RC, van der Laan MJ, Colford JM Jr, Hubbard AE. Causal Inference via Electronic Health Records in the National Clinical Cohort Collaborative: Challenges and Solutions in Long COVID Research. medRxiv [Preprint]. 2025 Jun 11:2025.06.06.25329168. doi: 10.1101/2025.06.06.25329168. PMID: 40502605; PMCID: PMC12155030.

Year of Medical School at Time of Visit
Sponsors
Dates of Travel
DOI
10.1101/2025.06.06.25329168
PubMed ID
40502605
Other Identifiers
Notes

This article is a preprint. Preprints are preliminary reports of work that have not been certified by peer review.

Funding and Acknowledgements
The UMass Center for Clinical and Translational Science (UMCCTS), UL1TR001453, helped fund this study.
Corresponding Author
Related Resources
Related Resources
Repository Citation
Rights
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.