Genomics and Computational Biology Publications
ABOUT THIS COLLECTION
The Department of Genomics and Computational Biology (GCB) at UMass Chan Medical School was originally established as the Program in Bioinformatics and Integrative Biology in 2008. The group evolved into a full-fledged department in 2023, reflecting their growth and the expanding scope of their research. The department embodies the convergence of Computational Biology, Evolutionary Biology, and Genomics and is committed to advancing understanding of biological complexity through cutting-edge computational methods, evolutionary theory, and genomic technologies. This collection showcases journal articles and other publications produced by faculty and researchers of the Department of Genomics and Computational Biology.
Recently Published
-
Single cell RNA-sequencing reveals molecular signatures that distinguish allergic from irritant contact dermatitisAllergic contact dermatitis (ACD) is a pruritic skin disease caused by environmental chemicals that induce cell-mediated skin inflammation within susceptible individuals. Irritant contact dermatitis (ICD) is caused by direct damage to the skin barrier by environmental insults. Diagnosis can be challenging as both types of contact dermatitis can appear similar by visual exam, and histopathological analysis does not reliably distinguish ACD from ICD. To discover specific biomarkers of ACD and ICD, we characterized the transcriptomic and proteomic changes that occur within the skin during each type of contact dermatitis. We induced ACD and ICD in healthy human volunteers and sampled skin using a non-scarring suction blister biopsy method that collects interstitial fluid and cellular infiltrate. Single cell RNA-sequencing analysis revealed that cell-specific transcriptome differences rather than cell type proportions best distinguished ACD from ICD. Allergy-specific genes were associated with upregulation of IFNG, and cell signaling network analysis implicated several other genes such as IL4, despite their low expression levels. We validated transcriptomic differences with proteomic assays on blister fluid and trained a logistic regression model on skin interstitial fluid proteins that could distinguish ACD from ICD and healthy control skin with 93% sensitivity and 93% specificity.
-
Cohesin-mediated chromatin remodeling controls the differentiation and function of conventional dendritic cells [preprint]The cohesin protein complex extrudes chromatin loops, stopping at CTCF-bound sites, to organize chromosomes into topologically associated domains, yet the biological implications of this process are poorly understood. We show that cohesin is required for the post-mitotic differentiation and function of antigen-presenting dendritic cells (DCs), particularly for antigen cross-presentation and IL-12 secretion by type 1 conventional DCs (cDC1s) in vivo. The chromatin organization of DCs was shaped by cohesin and the DC-specifying transcription factor IRF8, which controlled chromatin looping and chromosome compartmentalization, respectively. Notably, optimal expression of IRF8 itself required CTCF/cohesin-binding sites demarcating the Irf8 gene. During DC activation, cohesin was required for the induction of a subset of genes with distal enhancers. Accordingly, the deletion of CTCF sites flanking the Il12b gene reduced IL-12 production by cDC1s. Our data reveal an essential role of cohesin-mediated chromatin regulation in cell differentiation and function in vivo, and its bi-directional crosstalk with lineage-specifying transcription factors.
-
Biomarker Trajectory Prediction and Causal Analysis of the Impact of the Covid-19 Pandemic on CVD Patients using Machine LearningBackground: The COVID-19 pandemic disrupted healthcare services, increasing the susceptibility of high-risk patients including those with cardiovascular Diseases (CVDs), to adverse outcomes. Biomarkers provide insights into patients' underlying health status. However, few studies have investigated the effects of the COVID-19 pandemic on CVD biomarker trajectories using predictive modeling and causal analyses frameworks. Prior research explored the impacts of the COVID-19 pandemic on CVD severity and prognosis but did not investigate biomarker trajectories using Machine Learning (ML), which can discover complex multivariate relationships in multi-modal data. Objective: This study aimed to compare six ML regression models to select the best performing models for predicting biomarker trajectories in CVD patients using retrospective data. Subsequently, these models were used to assess the COVID-19 pandemic's impact on CVD patients and for causal analyses Approach: Using ML regression and causal inference, this study investigated the pandemic's impact on biomarker values of 80,917 CVD patients and 77,332 non-CVD controls, treated at two hospitals in Central Massachusetts between May 2018 and December 2021. ML regression algorithms, including Neural Networks (NN), Decision Trees (DT), Random Forests (RF), XGBoost, CATBoost and ADABoost, were trained and compared. Important CVD biomarkers (HbA1c, LDL cholesterol, BMI, and BP) were predicted as outcome variables with patients' risk factors (age, race, gender, socioeconomic status) as input variables. Shapley feature importance analyses identified the most predictive features, which were then utilized in Causal Analysis. A Difference-in-Differences (DID) approach within a Double/Debiased Machine Learning (DML) method isolated the pandemic's impact on biomarkers, while minimizing the effects of confounding factors. Results: CATBoost and XGBoost were the most predictive ML models for LDL cholesterol and HbA1c, yielding R 2 values of 0.13 and 0.10, respectively. RF outperformed other models for BMI and BP, achieving R 2 values of 0.192 and 0.071. The small R 2 values were due to the prevalence of categorical features in the data with substantial variation in biomarker values. Feature importance analysis determined age, socioeconomic status, and race/ethnicity to be important drivers of biomarker changes, highlighting the role of social determinants of health. DML with DID analysis revealed a statistically significant increase (p-value <0.05) in BMI and systolic BP values for CVD patients during the COVID-19 pandemic compared to the control group, their HbA1c and LDL cholesterol values actually improved during the pandemic, suggesting differential effects of the pandemic on key CVD biomarkers. Conclusion: Our proposed ML biomarker prediction models can facilitate personalized interventions and advance risk assessment for CVD patients. The predictive importance of factors such as age, socioeconomic status, and race highlights the need to address health disparities.
-
Impact of preanalytical factors on liquid biopsy in the canine cancer model [preprint]While liquid biopsy has potential to transform cancer diagnostics through minimally-invasive detection and monitoring of tumors, the impact of preanalytical factors such as the timing and anatomical location of blood draw is not well understood. To address this gap, we leveraged pet dogs with spontaneous cancer as a model system, as their compressed disease timeline facilitates rapid diagnostic benchmarking. Key liquid biopsy metrics from dogs were consistent with existing reports from human patients. The tumor content of samples was higher from venipuncture sites closer to the tumor and from a central vein. Metrics also differed between lymphoma and non-hematopoietic cancers, urging cancer-type-specific interpretation. Liquid biopsy was highly sensitive to disease status, with changes identified soon after post chemotherapy administration, and trends of increased tumor fraction and other metrics observed prior to clinical relapse in dogs with lymphoma or osteosarcoma. These data support the utility of pet dogs with cancer as a relevant system for advancing liquid biopsy platforms.
-
Systemic and skin-limited delayed-type drug hypersensitivity reactions associate with distinct resident and recruited T cell subsetsDelayed-type drug hypersensitivity reactions are major causes of morbidity and mortality. The origin, phenotype and function of pathogenic T cells across the spectrum of severity requires investigation. We leveraged recent technical advancements to study skin-resident memory T cells (TRM) versus recruited T cell subsets in the pathogenesis of severe systemic forms of disease, SJS/TEN and DRESS, and skin-limited disease, morbilliform drug eruption (MDE). Microscopy, bulk transcriptional profiling and scRNAseq + CITEseq + TCRseq supported in SJS/TEN clonal expansion and recruitment of cytotoxic CD8+ T cells from circulation into skin, along with expanded and non-expanded cytotoxic CD8+ skin TRM. Comparatively, MDE displayed a cytotoxic T cell profile in skin without appreciable expansion and recruitment of cytotoxic CD8+ T cells from circulation, implicating TRM as potential protagonists in skin-limited disease. Mechanistic interrogation in patients unable to recruit T cells from circulation into skin and in a parallel mouse model supported that skin TRM were sufficient to mediate MDE. Concomitantly, SJS/TEN displayed a reduced regulatory T cell (Treg) signature compared to MDE. DRESS demonstrated recruitment of cytotoxic CD8+ T cells into skin like SJS/TEN, yet a pro-Treg signature like MDE. These findings have important implications for fundamental skin immunology and clinical care.
-
Bigtools: a high-performance BigWig and BigBed library in RustMotivation: The BigWig and BigBed file formats were originally designed for the visualization of next-generation sequencing data through a genome browser. Due to their versatility, these formats have long since become ubiquitous for the storage of processed sequencing data and regularly serve as the basis for downstream data analysis. As the number and size of sequencing experiments continues to accelerate, there is an increasing demand to efficiently generate and query BigWig and BigBed files in a scalable and robust manner, and to efficiently integrate these functionalities into data analysis environments and third-party applications. Results: Here, we present Bigtools, a feature-complete, high-performance, and integrable software library for generating and querying both BigWig and BigBed files. Bigtools is written in the Rust programming language and includes a flexible suite of command line tools as well as bindings to Python. Availability and implementation: Bigtools is cross-platform and released under the MIT license. It is distributed on Crates.io, Bioconda, and the Python Package Index, and the source code is available at https://github.com/jackh726/bigtools.
-
Pairtools: From sequencing data to chromosome contactsThe field of 3D genome organization produces large amounts of sequencing data from Hi-C and a rapidly-expanding set of other chromosome conformation protocols (3C+). Massive and heterogeneous 3C+ data require high-performance and flexible processing of sequenced reads into contact pairs. To meet these challenges, we present pairtools-a flexible suite of tools for contact extraction from sequencing data. Pairtools provides modular command-line interface (CLI) tools that can be flexibly chained into data processing pipelines. The core operations provided by pairtools are parsing of.sam alignments into Hi-C pairs, sorting and removal of PCR duplicates. In addition, pairtools provides auxiliary tools for building feature-rich 3C+ pipelines, including contact pair manipulation, filtration, and quality control. Benchmarking pairtools against popular 3C+ data pipelines shows advantages of pairtools for high-performance and flexible 3C+ analysis. Finally, pairtools provides protocol-specific tools for restriction-based protocols, haplotype-resolved contacts, and single-cell Hi-C. The combination of CLI tools and tight integration with Python data analysis libraries makes pairtools a versatile foundation for a broad range of 3C+ pipelines.
-
Single-cell genomics and regulatory networks for 388 human brainsSingle-cell genomics is a powerful tool for studying heterogeneous tissues such as the brain. Yet little is understood about how genetic variants influence cell-level gene expression. Addressing this, we uniformly processed single-nuclei, multiomics datasets into a resource comprising >2.8 million nuclei from the prefrontal cortex across 388 individuals. For 28 cell types, we assessed population-level variation in expression and chromatin across gene families and drug targets. We identified >550,000 cell type-specific regulatory elements and >1.4 million single-cell expression quantitative trait loci, which we used to build cell-type regulatory and cell-to-cell communication networks. These networks manifest cellular changes in aging and neuropsychiatric disorders. We further constructed an integrative model accurately imputing single-cell expression and simulating perturbations; the model prioritized ~250 disease-risk genes and drug targets with associated cell types.
-
Cross-ancestry atlas of gene, isoform, and splicing regulation in the developing human brainNeuropsychiatric genome-wide association studies (GWASs), including those for autism spectrum disorder and schizophrenia, show strong enrichment for regulatory elements in the developing brain. However, prioritizing risk genes and mechanisms is challenging without a unified regulatory atlas. Across 672 diverse developing human brains, we identified 15,752 genes harboring gene, isoform, and/or splicing quantitative trait loci, mapping 3739 to cellular contexts. Gene expression heritability drops during development, likely reflecting both increasing cellular heterogeneity and the intrinsic properties of neuronal maturation. Isoform-level regulation, particularly in the second trimester, mediated the largest proportion of GWAS heritability. Through colocalization, we prioritized mechanisms for about 60% of GWAS loci across five disorders, exceeding adult brain findings. Finally, we contextualized results within gene and isoform coexpression networks, revealing the comprehensive landscape of transcriptome regulation in development and disease.
-
Using a comprehensive atlas and predictive models to reveal the complexity and evolution of brain-active regulatory elementsMost genetic variants associated with psychiatric disorders are located in noncoding regions of the genome. To investigate their functional implications, we integrate epigenetic data from the PsychENCODE Consortium and other published sources to construct a comprehensive atlas of candidate brain cis-regulatory elements. Using deep learning, we model these elements' sequence syntax and predict how binding sites for lineage-specific transcription factors contribute to cell type-specific gene regulation in various types of glia and neurons. The elements' evolutionary history suggests that new regulatory information in the brain emerges primarily via smaller sequence mutations within conserved mammalian elements rather than entirely new human- or primate-specific sequences. However, primate-specific candidate elements, particularly those active during fetal brain development and in excitatory neurons and astrocytes, are implicated in the heritability of brain-related human traits. Additionally, we introduce PsychSCREEN, a web-based platform offering interactive visualization of PsychENCODE-generated genetic and epigenetic data from diverse brain cell types in individuals with psychiatric disorders and healthy controls.
-
Inferring causal cell types of human diseases and risk variants from candidate regulatory elements [preprint]The heritability of human diseases is extremely enriched in candidate regulatory elements (cRE) from disease-relevant cell types. Critical next steps are to infer which and how many cell types are truly causal for a disease (after accounting for co-regulation across cell types), and to understand how individual variants impact disease risk through single or multiple causal cell types. Here, we propose CT-FM and CT-FM-SNP, two methods that leverage cell-type-specific cREs to fine-map causal cell types for a trait and for its candidate causal variants, respectively. We applied CT-FM to 63 GWAS summary statistics (average N = 417K) using nearly one thousand cRE annotations, primarily coming from ENCODE4. CT-FM inferred 81 causal cell types with corresponding SNP-annotations explaining a high fraction of trait SNP-heritability (~2/3 of the SNP-heritability explained by existing cREs), identified 16 traits with multiple causal cell types, highlighted cell-disease relationships consistent with known biology, and uncovered previously unexplored cellular mechanisms in psychiatric and immune-related diseases. Finally, we applied CT-FM-SNP to 39 UK Biobank traits and predicted high confidence causal cell types for 2,798 candidate causal non-coding SNPs. Our results suggest that most SNPs impact a phenotype through a single cell type, and that pleiotropic SNPs target different cell types depending on the phenotype context. Altogether, CT-FM and CT-FM-SNP shed light on how genetic variants act collectively and individually at the cellular level to impact disease risk.
-
Cooltools: Enabling high-resolution Hi-C analysis in PythonChromosome conformation capture (3C) technologies reveal the incredible complexity of genome organization. Maps of increasing size, depth, and resolution are now used to probe genome architecture across cell states, types, and organisms. Larger datasets add challenges at each step of computational analysis, from storage and memory constraints to researchers' time; however, analysis tools that meet these increased resource demands have not kept pace. Furthermore, existing tools offer limited support for customizing analysis for specific use cases or new biology. Here we introduce cooltools (https://github.com/open2c/cooltools), a suite of computational tools that enables flexible, scalable, and reproducible analysis of high-resolution contact frequency data. Cooltools leverages the widely-adopted cooler format which handles storage and access for high-resolution datasets. Cooltools provides a paired command line interface (CLI) and Python application programming interface (API), which respectively facilitate workflows on high-performance computing clusters and in interactive analysis environments. In short, cooltools enables the effective use of the latest and largest genome folding datasets.
-
A single cell atlas of the mouse seminal vesicle [preprint]During mammalian reproduction, sperm are delivered to the female reproductive tract bathed in a complex medium known as seminal fluid, which plays key roles in signaling to the female reproductive tract and in nourishing sperm for their onwards journey. Along with minor contributions from the prostate and the epididymis, the majority of seminal fluid is produced by a somewhat understudied organ known as the seminal vesicle. Here, we report the first single-cell RNA-seq atlas of the mouse seminal vesicle, generated using tissues obtained from 23 mice of varying ages, exposed to a range of dietary challenges. We define the transcriptome of the secretory cells in this tissue, identifying a relatively homogeneous population of the epithelial cells which are responsible for producing the majority of seminal fluid. We also define the immune cell populations - including large populations of macrophages, dendritic cells, T cells, and NKT cells - which have the potential to play roles in producing various immune mediators present in seminal plasma. Together, our data provide a resource for understanding the composition of an understudied reproductive tissue with potential implications for paternal control of offspring development and metabolism.
-
Single-cell genomics and regulatory networks for 388 human brains [preprint]Single-cell genomics is a powerful tool for studying heterogeneous tissues such as the brain. Yet, little is understood about how genetic variants influence cell-level gene expression. Addressing this, we uniformly processed single-nuclei, multi-omics datasets into a resource comprising >2.8M nuclei from the prefrontal cortex across 388 individuals. For 28 cell types, we assessed population-level variation in expression and chromatin across gene families and drug targets. We identified >550K cell-type-specific regulatory elements and >1.4M single-cell expression-quantitative-trait loci, which we used to build cell-type regulatory and cell-to-cell communication networks. These networks manifest cellular changes in aging and neuropsychiatric disorders. We further constructed an integrative model accurately imputing single-cell expression and simulating perturbations; the model prioritized ~250 disease-risk genes and drug targets with associated cell types.
-
Vocal learning-associated convergent evolution in mammalian proteins and regulatory elementsVocal production learning ("vocal learning") is a convergently evolved trait in vertebrates. To identify brain genomic elements associated with mammalian vocal learning, we integrated genomic, anatomical, and neurophysiological data from the Egyptian fruit bat (Rousettus aegyptiacus) with analyses of the genomes of 215 placental mammals. First, we identified a set of proteins evolving more slowly in vocal learners. Then, we discovered a vocal motor cortical region in the Egyptian fruit bat, an emergent vocal learner, and leveraged that knowledge to identify active cis-regulatory elements in the motor cortex of vocal learners. Machine learning methods applied to motor cortex open chromatin revealed 50 enhancers robustly associated with vocal learning whose activity tended to be lower in vocal learners. Our research implicates convergent losses of motor cortex regulatory elements in mammalian vocal learning evolution.
-
Expression of ALS-PFN1 impairs vesicular degradation in iPSC-derived microgliaMicroglia play a pivotal role in neurodegenerative disease pathogenesis, but the mechanisms underlying microglia dysfunction and toxicity remain to be elucidated. To investigate the effect of neurodegenerative disease-linked genes on the intrinsic properties of microglia, we studied microglia-like cells derived from human induced pluripotent stem cells (iPSCs), termed iMGs, harboring mutations in profilin-1 (PFN1) that are causative for amyotrophic lateral sclerosis (ALS). ALS-PFN1 iMGs exhibited evidence of lipid dysmetabolism, autophagy dysregulation and deficient phagocytosis, a canonical microglia function. Mutant PFN1 also displayed enhanced binding affinity for PI3P, a critical signaling molecule involved in autophagic and endocytic processing. Our cumulative data implicate a gain-of-toxic function for mutant PFN1 within the autophagic and endo-lysosomal pathways, as administration of rapamycin rescued phagocytic dysfunction in ALS-PFN1 iMGs. These outcomes demonstrate the utility of iMGs for neurodegenerative disease research and implicate microglial vesicular degradation pathways in the pathogenesis of these disorders.
-
Multicenter integrated analysis of noncoding CRISPRi screensThe ENCODE Consortium's efforts to annotate noncoding cis-regulatory elements (CREs) have advanced our understanding of gene regulatory landscapes. Pooled, noncoding CRISPR screens offer a systematic approach to investigate cis-regulatory mechanisms. The ENCODE4 Functional Characterization Centers conducted 108 screens in human cell lines, comprising >540,000 perturbations across 24.85 megabases of the genome. Using 332 functionally confirmed CRE-gene links in K562 cells, we established guidelines for screening endogenous noncoding elements with CRISPR interference (CRISPRi), including accurate detection of CREs that exhibit variable, often low, transcriptional effects. Benchmarking five screen analysis tools, we find that CASA produces the most conservative CRE calls and is robust to artifacts of low-specificity single guide RNAs. We uncover a subtle DNA strand bias for CRISPRi in transcribed regions with implications for screen design and analysis. Together, we provide an accessible data resource, predesigned single guide RNAs for targeting 3,275,697 ENCODE SCREEN candidate CREs with CRISPRi and screening guidelines to accelerate functional characterization of the noncoding genome.
-
Genome-wide association study identifies 30 obsessive-compulsive disorder associated loci [preprint]Obsessive-compulsive disorder (OCD) affects ~1% of the population and exhibits a high SNP-heritability, yet previous genome-wide association studies (GWAS) have provided limited information on the genetic etiology and underlying biological mechanisms of the disorder. We conducted a GWAS meta-analysis combining 53,660 OCD cases and 2,044,417 controls from 28 European-ancestry cohorts revealing 30 independent genome-wide significant SNPs and a SNP-based heritability of 6.7%. Separate GWAS for clinical, biobank, comorbid, and self-report sub-groups found no evidence of sample ascertainment impacting our results. Functional and positional QTL gene-based approaches identified 249 significant candidate risk genes for OCD, of which 25 were identified as putatively causal, highlighting WDR6, DALRD3, CTNND1 and genes in the MHC region. Tissue and single-cell enrichment analyses highlighted hippocampal and cortical excitatory neurons, along with D1- and D2-type dopamine receptor-containing medium spiny neurons, as playing a role in OCD risk. OCD displayed significant genetic correlations with 65 out of 112 examined phenotypes. Notably, it showed positive genetic correlations with all included psychiatric phenotypes, in particular anxiety, depression, anorexia nervosa, and Tourette syndrome, and negative correlations with a subset of the included autoimmune disorders, educational attainment, and body mass index.. This study marks a significant step toward unraveling its genetic landscape and advances understanding of OCD genetics, providing a foundation for future interventions to address this debilitating disorder.
-
Evaluating the spike in the symptomatic proportion of SARS-CoV-2 in China in 2022 with variolation effects: a modeling analysisDespite most COVID-19 infections being asymptomatic, mainland China had a high increase in symptomatic cases at the end of 2022. In this study, we examine China's sudden COVID-19 symptomatic surge using a conceptual SIR-based model. Our model considers the epidemiological characteristics of SARS-CoV-2, particularly variolation, from non-pharmaceutical intervention (facial masking and social distance), demography, and disease mortality in mainland China. The increase in symptomatic proportions in China may be attributable to (1) higher sensitivity and vulnerability during winter and (2) enhanced viral inhalation due to spikes in SARS-CoV-2 infections (high transmissibility). These two reasons could explain China's high symptomatic proportion of COVID-19 in December 2022. Our study, therefore, can serve as a decision-support tool to enhance SARS-CoV-2 prevention and control efforts. Thus, we highlight that facemask-induced variolation could potentially reduces transmissibility rather than severity in infected individuals. However, further investigation is required to understand the variolation effect on disease severity.
-
Genome-wide association study identifies new loci associated with OCD [preprint]To date, four genome-wide association studies (GWAS) of obsessive-compulsive disorder (OCD) have been published, reporting a high single-nucleotide polymorphism (SNP)-heritability of 28% but finding only one significant SNP. A substantial increase in sample size will likely lead to further identification of SNPs, genes, and biological pathways mediating the susceptibility to OCD. We conducted a GWAS meta-analysis with a 2-3-fold increase in case sample size (OCD cases: N = 37,015, controls: N = 948,616) compared to the last OCD GWAS, including six previously published cohorts (OCGAS, IOCDF-GC, IOCDF-GC-trio, NORDiC-nor, NORDiC-swe, and iPSYCH) and unpublished self-report data from 23andMe Inc. We explored the genetic architecture of OCD by conducting gene-based tests, tissue and celltype enrichment analyses, and estimating heritability and genetic correlations with 74 phenotypes. To examine a potential heterogeneity in our data, we conducted multivariable GWASs with MTAG. We found support for 15 independent genome-wide significant loci (14 new) and 79 protein-coding genes. Tissue enrichment analyses implicate multiple cortical regions, the amygdala, and hypothalamus, while cell type analyses yielded 12 cell types linked to OCD (all neurons). The SNP-based heritability of OCD was estimated to be 0.08. Using MTAG we found evidence for specific genetic underpinnings characteristic of different cohort-ascertainment and identified additional significant SNPs. OCD was genetically correlated with 40 disorders or traits-positively with all psychiatric disorders and negatively with BMI, age at first birth and multiple autoimmune diseases. The GWAS meta-analysis identified several biologically informative genes as important contributors to the aetiology of OCD. Overall, we have begun laying the groundwork through which the biology of OCD will be understood and described.