Authors
Pfaff, Emily R.Girvin, Andrew T.
Bennett, Tellen D.
Bhatia, Abhishek
Brooks, Ian M.
Deer, Rachel R.
Dekermanjian, Jonathan P.
Jolley, Sarah Elizabeth
Kahn, Michael G.
Kostka, Kristin
McMurry, Julie A.
Moffitt, Richard
Walden, Anita
Chute, Christopher G.
Haendel, Melissa A.
The N3C Consortium
UMass Chan Affiliations
UMass Center for Clinical and Translational ScienceDocument Type
PreprintPublication Date
2021-10-22Keywords
Infectious Diseaseslong COVID
Post-acute sequelae of SARS-CoV-2 infection
PASC
electronic health records
diagnosis
Computer Sciences
Data Science
Diagnosis
Health Information Technology
Infectious Disease
Translational Medical Research
Virus Diseases
Metadata
Show full item recordAbstract
Background Post-acute sequelae of SARS-CoV-2 infection (PASC), otherwise known as long-COVID, have severely impacted recovery from the pandemic for patients and society alike. This new disease is characterized by evolving, heterogeneous symptoms, making it challenging to derive an unambiguous long-COVID definition. Electronic health record (EHR) studies are a critical element of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative, which is addressing the urgent need to understand PASC, accurately identify who has PASC, and identify treatments. Methods Using the National COVID Cohort Collaborative’s (N3C) EHR repository, we developed XGBoost machine learning (ML) models to identify potential long-COVID patients. We examined demographics, healthcare utilization, diagnoses, and medications for 97,995 adult COVID-19 patients. We used these features and 597 long-COVID clinic patients to train three ML models to identify potential long-COVID patients among (1) all COVID-19 patients, (2) patients hospitalized with COVID-19, and (3) patients who had COVID-19 but were not hospitalized. Findings Our models identified potential long-COVID patients with high accuracy, achieving areas under the receiver operator characteristic curve of 0.91 (all patients), 0.90 (hospitalized); and 0.85 (non-hospitalized). Important features include rate of healthcare utilization, patient age, dyspnea, and other diagnosis and medication information available within the EHR. Applying the “all patients” model to the larger N3C cohort identified 100,263 potential long-COVID patients. Interpretation Patients flagged by our models can be interpreted as “patients likely to be referred to or seek care at a long-COVID specialty clinic,” an essential proxy for long-COVID diagnosis in the current absence of a definition. We also achieve the urgent goal of identifying potential long-COVID patients for clinical trials. As more data sources are identified, the models can be retrained and tuned based on study needs. Funding This study was funded by NCATS and NIH through the RECOVER Initiative.Source
medRxiv 2021.10.18.21265168; doi: https://doi.org/10.1101/2021.10.18.21265168. Link to preprint on medRxiv
DOI
10.1101/2021.10.18.21265168Permanent Link to this Item
http://hdl.handle.net/20.500.14038/50431Notes
This article is a preprint. Preprints are preliminary reports of work that have not been certified by peer review.
The UMass Center for Clinical and Translational Science (UMCCTS), UL1TR001453, helped fund this study.
Rights
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.Distribution License
http://creativecommons.org/licenses/by/4.0/ae974a485f413a2113503eed53cd6c53
10.1101/2021.10.18.21265168
Scopus Count
Except where otherwise noted, this item's license is described as The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.