Knowledge-Guided Machine Learning for Precision Medicine: New Methods for Robust Biomarker Discovery and Interpretable Risk Stratification from High-Dimensional Omics Data
Citations
Authors
Student Authors
Faculty Advisor
Academic Program
UMass Chan Affiliations
Document Type
Publication Date
Subject Area
Collections
Files
- Embargoed until 2027-11-11
Embargo Expiration Date
Link to Full Text
Abstract
Precision medicine aims to integrate individual molecular information with clinical profiles to improve disease prediction and treatment. However, biomedical omics data are typically both high-dimensional and limited in sample size (𝑝≫𝑛), creating major analytical challenges. Naïve application of "black box" machine learning models to such data can lead to overfitting and statistical instability. This, and their inherent lack of biological interpretability, limits the value of such models in clinical translation.
This work explore the value of integrating biological domain knowledge as a structural constraint to address these challenges. This knowledge-guided paradigm enhances data efficiency, improves statistical robustness, and ensures mechanistic interpretability. To validate this approach, two knowledge-guided computational methods were developed to address critical tasks in precision medicine: biomarker discovery and disease risk prediction.
For biomarker discovery, BAMBI (Biostatistical and Artificial-intelligence Methods for Biomarker Identification) was developed. By combining biologically informed statistical filtering with robust machine learning-based feature selection, BAMBI identifies parsimonious yet highly predictive RNA biomarker panels, outperforming existing methods in small-sample scenarios.
For disease risk stratification, a pathway-guided Graph Neural Network was constructed to predict 10-year cardiovascular disease risk. By embedding pathway structures into the model architecture and incorporating a demographic-guided attention mechanism, this method substantially outperformed the Framingham Risk Score with respect to balanced accuracy and sensitivity while identifying biologically relevant central genes as key risk drivers.
This work demonstrates that integrating biological knowledge into machine learning is a powerful strategy for translating complex omics data into robust, interpretable, and clinically actionable insights.