Integrating Symbolic Regression for Generalizable and Interpretable Machine Learning in Cardiovascular Risk Prediction
Ferguson, Michael
Citations
Authors
Student Authors
Faculty Advisor
Academic Program
UMass Chan Affiliations
Document Type
Publication Date
Keywords
Subject Area
Embargo Expiration Date
Link to Full Text
Abstract
Introduction: Cardiovascular disease (CVD) risk prediction is crucial for timely, targeted risk factor modification. Machine learning (ML) is often applied to CVD risk prediction to improve diagnosis and treatment but with modest performance [1] and limited interpretability - often requiring large datasets with a risk of overfitting. Enhancing the generalizability and interpretability of CVD risk prediction ML models is critical for their broad adoption and clinical deployment. We explored the integration of symbolic regression (SR) with random forests (RF) to address this challenge. Methods: We analyzed 518,389 patients in the UMass Memorial Clarity data lake research database. These patients had at least one clinical encounter between Oct 1, 2017 and Nov 1, 2018, without any indication of death or CVD during that time based on ICD-10 codes. [2] Demographic features (e.g., age, race/ethnicity) and known risk factors (e.g., diagnoses, lab results) were extracted from the electronic health records. Pre-processing included examination of missing/spurious data, imputation using MissForest, one-hot encoding, and aggregation of laboratory and diagnostic features by condition and measured quantity. We developed and validated a gpu-accelerated SR-enhanced RF (SReRF), motivated by prior work hybridizing SR with decision trees, [3] to predict 5-year CVD risk. Performance metrics included precision, recall, F1, area under the ROC curve (AUROC), and area under the precision-recall curve (AUPRC). We did a 75-25 split for training/testing data and tuned an RF using cross validation and grid search. Results: The patient cohort had a mean (SD) age of 48.6 (18.7), with 5.8% CVD events observed in the 5-year window. SReRF yielded superior performance compared with classical RF. Specifically, classical RF had an AUROC (SD) of 0.79 (0.0027), AUPRC 0.18 (0.0027), and F1 0.27 (0.0032), compared with SReRF’s AUROC of 0.82 (0.0023), AUPRC 0.20 (0.0034), and F1 0.29 (0.0034). Here, SDs were computed using 500 bootstrapped samples of the test set. Visual examination of PR curves indicated that the SReRF’s performance improvement was substantial for thresholds associated with lower recall values, but modest for higher recall thresholds. Also, whereas the classical RF overfit the training set, the SReRF had nearly identical performance on both training and test sets. The main predictors in SReRF (higher in the decision trees) were age, HDL and LDL cholesterol, hypertension, atrial fibrillation, chronic obstructive pulmonary disease, obesity, albumin, creatinine, and self-reported Hispanic/Latinx ethnicity. Discussion and Conclusion: This study demonstrates promising results of integrating symbolic regression with random forest for CVD risk prediction. We found that SReRF outperformed classic RF. The top SReRF predictors align with known risk factors for CVD, adding face validity and reinforcing its potential clinical utility. Further, the symbolic expressions that determined splits produced compact and meaningful relationships between predictors, improving model interpretability. Expressions revealed the simultaneous presence of related comorbidities (e.g., hyperlipidemia and hypertension) and non-linear relationships between numerical predictors (e.g. age and LDL). Of note, ethnicity (Hispanic/Latinx) among identified main predictors may reflect underlying disparities in social determinants of health or biological risk factors. Despite recognized advantages of SR, limitations exist, including long training time, difficulty handling missing data, and limited scalability with high-dimension features. Future models will address these issues, include enhanced pre-processing, computational optimization, additional clinical features (e.g. medications), and be evaluated against other contemporary ML techniques. In conclusion, SReRF represents a promising, interpretable, and generalizable approach for CVD risk prediction, with strong potential for implementation in clinical settings, especially those with a high demand for transparency in predictive modeling. References
- Soares C, Kwok M, Boucher K, et al. Performance of Cardiovascular Risk Prediction Models Among People Living With HIV: A Systematic Review and Meta-analysis. JAMA Cardiol. 2023;8(2):139–149.
- https://vsac.nlm.nih.gov/valueset/2.16.840.1.113762.1.4.1078.90/expansion/Latest
- Fong, Kei Sen, and Mehul Motani. 2024. “Symbolic Regression Enhanced Decision Trees for Classification Tasks.” Proceedings of the AAAI Conference on Artificial Intelligence 38 (11): 12033–42.