Genomics and Computational Biology Publications

Permanent URI for this collection

ABOUT THIS COLLECTION

The Department of Genomics and Computational Biology (GCB) at UMass Chan Medical School was originally established as the Program in Bioinformatics and Integrative Biology in 2008. The group evolved into a full-fledged department in 2023, reflecting their growth and the expanding scope of their research. The department embodies the convergence of Computational Biology, Evolutionary Biology, and Genomics and is committed to advancing understanding of biological complexity through cutting-edge computational methods, evolutionary theory, and genomic technologies. This collection showcases journal articles and other publications produced by faculty and researchers of the Department of Genomics and Computational Biology.

Browse

Recent Publications

  • Publication
    An Expanded Registry of Candidate cis-Regulatory Elements for Studying Transcriptional Regulation [preprint]
    (2024-12-26) Moore, Jill E; Pratt, Henry E; Fan, Kaili; Phalke, Nishigandha; Fisher, Jonathan; Elhajjajy, Shaimae I; Andrews, Gregory; Gao, Mingshi; Shedd, Nicole; Fu, Yu; Lacadie, Matthew C; Meza, Jair; Ganna, Mohit; Choudhury, Eva; Swofford, Ross; Farrell, Nina P; Pampari, Anusri; Ramalingam, Vivekanandan; Reese, Fairlie; Borsari, Beatrice; Yu, Michelle; Wattenberg, Eve; Ruiz-Romero, Marina; Razavi-Mohseni, Milad; Xu, Jinrui; Galeev, Timur; Beer, Michael A; Guigó, Roderic; Gerstein, Mark; Engreitz, Jesse; Ljungman, Mats; Reddy, Timothy E; Snyder, Michael P; Epstein, Charles B; Gaskell, Elizabeth; Bernstein, Bradley E; Dickel, Diane E; Visel, Axel; Pennacchio, Len A; Mortazavi, Ali; Kundaje, Anshul; Weng, Zhiping; Genomics and Computational Biology
    Mammalian genomes contain millions of regulatory elements that control the complex patterns of gene expression. Previously, The ENCODE consortium mapped biochemical signals across many cell types and tissues and integrated these data to develop a Registry of 0.9 million human and 300 thousand mouse candidate cis-Regulatory Elements (cCREs) annotated with potential functions. We have expanded the Registry to include 2.35 million human and 927 thousand mouse cCREs, leveraging new ENCODE datasets and enhanced computational methods. This expanded Registry covers hundreds of unique cell and tissue types, providing a comprehensive understanding of gene regulation. Functional characterization data from assays like STARR-seq, MPRA, CRISPR perturbation, and transgenic mouse assays now cover over 90% of human cCREs, revealing complex regulatory functions. We identified thousands of novel silencer cCREs and demonstrated their dual enhancer/silencer roles in different cellular contexts. Integrating the Registry with other ENCODE annotations facilitates genetic variation interpretation and trait-associated gene identification, exemplified by discovering as a novel causal gene for red blood cell traits. This expanded Registry is a valuable resource for studying the regulatory genome and its impact on health and disease.
  • Publication
    The Trajectory of KoRV-A Evolution Indicates Initial Integration into the Koala Germline Genome Near Coffs Harbour [preprint]
    (2024-12-23) Yu, Tianxiong; Blyton, Michaela B J; Koppetsch, Birgit S; Abajorga, Milky; Luban, Jeremy; Chappell, Keith; Theurkauf, William E; Weng, Zhiping; Program in Molecular Medicine; Genomics and Computational Biology
    Background: Koala Retrovirus-A is a gamma-retrovirus that is spreading across wild koala populations through horizontal and vertical transmission, contributing significantly to genomic diversity across and even within koala populations. Previous studies have estimated that KoRV-A initially integrated into the koala genome less than 50,000 years ago, but the precise origins and the patterns of spread after its endogenization remain unclear. Results: In this study, we analyzed germline insertions of KoRV-A using whole-genome sequencing data from 405 wild koalas, representing nearly the species' entire geographic range. Our findings reveal an evolutionary trajectory for KoRV-A, suggesting that the initial endogenization might occur near Coffs Harbour on the Mid-north coast of NSW around the middle of the koala's range. As KoRV-A spread, certain subtypes emerged and became prevalent, two of which recombined with an ancient endogenous retrovirus, PhER, resulting in distinct recombination variants in northern and southern koala populations. Additionally, we identified a geographic barrier north of Sydney, which may have slowed the southward spread of KoRV-A into Sydney and beyond. Conclusions: Our study proposes a comprehensive evolutionary pathway for KoRV-A, beginning with its initial endogenization near Coffs Harbour and highlighting barriers and diversification events that have shaped its distribution and impact on koala populations.
  • Publication
    An epidemiological modeling investigation of the long-term changing dynamics of the plague epidemics in Hong Kong
    (2024-10-28) Musa, Salihu S; Zhao, Shi; Mkandawire, Winnie; Colubri, Andrés; He, Daihai; Genomics and Computational Biology
    Identifying epidemic-driving factors through epidemiological modeling is a crucial public health strategy that has substantial policy implications for control and prevention initiatives. In this study, we employ dynamic modeling to investigate the transmission dynamics of pneumonic plague epidemics in Hong Kong from 1902 to 1904. Through the integration of human, flea, and rodent populations, we analyze the long-term changing trends and identify the epidemic-driving factors that influence pneumonic plague outbreaks. We examine the dynamics of the model and derive epidemic metrics, such as reproduction numbers, that are used to assess the effectiveness of intervention. By fitting our model to historical pneumonic plague data, we accurately capture the incidence curves observed during the epidemic periods, which reveals some crucial insights into the dynamics of pneumonic plague transmission by identifying the epidemic driving factors and quantities such as the lifespan of flea vectors, the rate of rodent spread, as well as demographic parameters. We emphasize that effective control measures must be prioritized for the elimination of fleas and rodent vectors to mitigate future plague outbreaks. These findings underscore the significance of proactive intervention strategies in managing infectious diseases and informing public health policies.
  • Publication
    Biomarker Trajectory Prediction and Causal Analysis of the Impact of the Covid-19 Pandemic on CVD Patients using Machine Learning
    (2024-08-05) Inekwe, Trusting; Mkandawire, Winnie; Wee, Brian; Agu, Emmanuel; Colubri, Andres; Center for Accelerating Practices to End Suicide (CAPES); Genomics and Computational Biology; Morningside Graduate School of Biomedical Sciences; Winnie Mkandawire
    Background: The COVID-19 pandemic disrupted healthcare services, increasing the susceptibility of high-risk patients including those with cardiovascular Diseases (CVDs), to adverse outcomes. Biomarkers provide insights into patients' underlying health status. However, few studies have investigated the effects of the COVID-19 pandemic on CVD biomarker trajectories using predictive modeling and causal analyses frameworks. Prior research explored the impacts of the COVID-19 pandemic on CVD severity and prognosis but did not investigate biomarker trajectories using Machine Learning (ML), which can discover complex multivariate relationships in multi-modal data. Objective: This study aimed to compare six ML regression models to select the best performing models for predicting biomarker trajectories in CVD patients using retrospective data. Subsequently, these models were used to assess the COVID-19 pandemic's impact on CVD patients and for causal analyses Approach: Using ML regression and causal inference, this study investigated the pandemic's impact on biomarker values of 80,917 CVD patients and 77,332 non-CVD controls, treated at two hospitals in Central Massachusetts between May 2018 and December 2021. ML regression algorithms, including Neural Networks (NN), Decision Trees (DT), Random Forests (RF), XGBoost, CATBoost and ADABoost, were trained and compared. Important CVD biomarkers (HbA1c, LDL cholesterol, BMI, and BP) were predicted as outcome variables with patients' risk factors (age, race, gender, socioeconomic status) as input variables. Shapley feature importance analyses identified the most predictive features, which were then utilized in Causal Analysis. A Difference-in-Differences (DID) approach within a Double/Debiased Machine Learning (DML) method isolated the pandemic's impact on biomarkers, while minimizing the effects of confounding factors. Results: CATBoost and XGBoost were the most predictive ML models for LDL cholesterol and HbA1c, yielding R 2 values of 0.13 and 0.10, respectively. RF outperformed other models for BMI and BP, achieving R 2 values of 0.192 and 0.071. The small R 2 values were due to the prevalence of categorical features in the data with substantial variation in biomarker values. Feature importance analysis determined age, socioeconomic status, and race/ethnicity to be important drivers of biomarker changes, highlighting the role of social determinants of health. DML with DID analysis revealed a statistically significant increase (p-value <0.05) in BMI and systolic BP values for CVD patients during the COVID-19 pandemic compared to the control group, their HbA1c and LDL cholesterol values actually improved during the pandemic, suggesting differential effects of the pandemic on key CVD biomarkers. Conclusion: Our proposed ML biomarker prediction models can facilitate personalized interventions and advance risk assessment for CVD patients. The predictive importance of factors such as age, socioeconomic status, and race highlights the need to address health disparities.
  • Publication
    Impact of preanalytical factors on liquid biopsy in the canine cancer model [preprint]
    (2024-07-30) Megquier, Kate; Husted, Christopher; Rhoades, Justin; White, Michelle E; Genereux, Diane P; Chen, Frances L; Xiong, Kan; Kwon, Euijin; Swofford, Ross; Painter, Corrie; Adalsteinsson, Viktor; London, Cheryl A; Gardner, Heather L; Karlsson, Elinor K; Genomics and Computational Biology; Morningside Graduate School of Biomedical Sciences; Program in Molecular Medicine; Christopher Husted; Euijin Kwon
    While liquid biopsy has potential to transform cancer diagnostics through minimally-invasive detection and monitoring of tumors, the impact of preanalytical factors such as the timing and anatomical location of blood draw is not well understood. To address this gap, we leveraged pet dogs with spontaneous cancer as a model system, as their compressed disease timeline facilitates rapid diagnostic benchmarking. Key liquid biopsy metrics from dogs were consistent with existing reports from human patients. The tumor content of samples was higher from venipuncture sites closer to the tumor and from a central vein. Metrics also differed between lymphoma and non-hematopoietic cancers, urging cancer-type-specific interpretation. Liquid biopsy was highly sensitive to disease status, with changes identified soon after post chemotherapy administration, and trends of increased tumor fraction and other metrics observed prior to clinical relapse in dogs with lymphoma or osteosarcoma. These data support the utility of pet dogs with cancer as a relevant system for advancing liquid biopsy platforms.
  • Publication
    Cohesin-mediated chromatin remodeling controls the differentiation and function of conventional dendritic cells [preprint]
    (2024-09-22) Adams, Nicholas M; Galitsyna, Aleksandra; Tiniakou, Ioanna; Esteva, Eduardo; Lau, Colleen M; Reyes, Jojo; Abdennur, Nezar; Shkolikov, Alexey; Yap, George S; Khodadadi-Jamayran, Alireza; Mirny, Leonid A; Reizis, Boris; Genomics and Computational Biology; Systems Biology
    The cohesin protein complex extrudes chromatin loops, stopping at CTCF-bound sites, to organize chromosomes into topologically associated domains, yet the biological implications of this process are poorly understood. We show that cohesin is required for the post-mitotic differentiation and function of antigen-presenting dendritic cells (DCs), particularly for antigen cross-presentation and IL-12 secretion by type 1 conventional DCs (cDC1s) in vivo. The chromatin organization of DCs was shaped by cohesin and the DC-specifying transcription factor IRF8, which controlled chromatin looping and chromosome compartmentalization, respectively. Notably, optimal expression of IRF8 itself required CTCF/cohesin-binding sites demarcating the Irf8 gene. During DC activation, cohesin was required for the induction of a subset of genes with distal enhancers. Accordingly, the deletion of CTCF sites flanking the Il12b gene reduced IL-12 production by cDC1s. Our data reveal an essential role of cohesin-mediated chromatin regulation in cell differentiation and function in vivo, and its bi-directional crosstalk with lineage-specifying transcription factors.
  • Publication
    Single cell RNA-sequencing reveals molecular signatures that distinguish allergic from irritant contact dermatitis
    (2024-09-26) Frisoli, Michael L; Ko, Wei-Che C; Martinez, Nuria; Afshari, Khashayar; Wang, Yuqing; Garber, Manuel; Harris, John E; Dermatology; Genomics and Computational Biology; Morningside Graduate School of Biomedical Sciences; Garber Lab; Michael L Frisoli; Yuqing Wang
    Allergic contact dermatitis (ACD) is a pruritic skin disease caused by environmental chemicals that induce cell-mediated skin inflammation within susceptible individuals. Irritant contact dermatitis (ICD) is caused by direct damage to the skin barrier by environmental insults. Diagnosis can be challenging as both types of contact dermatitis can appear similar by visual exam, and histopathological analysis does not reliably distinguish ACD from ICD. To discover specific biomarkers of ACD and ICD, we characterized the transcriptomic and proteomic changes that occur within the skin during each type of contact dermatitis. We induced ACD and ICD in healthy human volunteers and sampled skin using a non-scarring suction blister biopsy method that collects interstitial fluid and cellular infiltrate. Single cell RNA-sequencing analysis revealed that cell-specific transcriptome differences rather than cell type proportions best distinguished ACD from ICD. Allergy-specific genes were associated with upregulation of IFNG, and cell signaling network analysis implicated several other genes such as IL4, despite their low expression levels. We validated transcriptomic differences with proteomic assays on blister fluid and trained a logistic regression model on skin interstitial fluid proteins that could distinguish ACD from ICD and healthy control skin with 93% sensitivity and 93% specificity.
  • Publication
    Systemic and skin-limited delayed-type drug hypersensitivity reactions associate with distinct resident and recruited T cell subsets
    (2024-07-23) Shah, Pranali N; Romar, George A; Manukyan, Artür; Ko, Wei-Che; Hsieh, Pei-Chen; Velasquez, Gustavo A; Schunkert, Elisa M; Fu, Xiaopeng; Guleria, Indira; Bronson, Roderick T; Wei, Kevin; Waldman, Abigail H; Vleugels, Frank R; Liang, Marilyn G; Giobbie-Hurder, Anita; Mostaghimi, Arash; Schmidt, Birgitta Ar; Barrera, Victor; Foreman, Ruth K; Garber, Manuel; Divito, Sherrie J; Dermatology; Genomics and Computational Biology; Garber Lab
    Delayed-type drug hypersensitivity reactions are major causes of morbidity and mortality. The origin, phenotype and function of pathogenic T cells across the spectrum of severity requires investigation. We leveraged recent technical advancements to study skin-resident memory T cells (TRM) versus recruited T cell subsets in the pathogenesis of severe systemic forms of disease, SJS/TEN and DRESS, and skin-limited disease, morbilliform drug eruption (MDE). Microscopy, bulk transcriptional profiling and scRNAseq + CITEseq + TCRseq supported in SJS/TEN clonal expansion and recruitment of cytotoxic CD8+ T cells from circulation into skin, along with expanded and non-expanded cytotoxic CD8+ skin TRM. Comparatively, MDE displayed a cytotoxic T cell profile in skin without appreciable expansion and recruitment of cytotoxic CD8+ T cells from circulation, implicating TRM as potential protagonists in skin-limited disease. Mechanistic interrogation in patients unable to recruit T cells from circulation into skin and in a parallel mouse model supported that skin TRM were sufficient to mediate MDE. Concomitantly, SJS/TEN displayed a reduced regulatory T cell (Treg) signature compared to MDE. DRESS demonstrated recruitment of cytotoxic CD8+ T cells into skin like SJS/TEN, yet a pro-Treg signature like MDE. These findings have important implications for fundamental skin immunology and clinical care.
  • Publication
    Bigtools: a high-performance BigWig and BigBed library in Rust
    (2024-06-03) Huey, Jack D; Abdennur, Nezar; Diabetes Center of Excellence; Genomics and Computational Biology; Morningside Graduate School of Biomedical Sciences; Program in Molecular Medicine; Systems Biology; Jack D Huey
    Motivation: The BigWig and BigBed file formats were originally designed for the visualization of next-generation sequencing data through a genome browser. Due to their versatility, these formats have long since become ubiquitous for the storage of processed sequencing data and regularly serve as the basis for downstream data analysis. As the number and size of sequencing experiments continues to accelerate, there is an increasing demand to efficiently generate and query BigWig and BigBed files in a scalable and robust manner, and to efficiently integrate these functionalities into data analysis environments and third-party applications. Results: Here, we present Bigtools, a feature-complete, high-performance, and integrable software library for generating and querying both BigWig and BigBed files. Bigtools is written in the Rust programming language and includes a flexible suite of command line tools as well as bindings to Python. Availability and implementation: Bigtools is cross-platform and released under the MIT license. It is distributed on Crates.io, Bioconda, and the Python Package Index, and the source code is available at https://github.com/jackh726/bigtools.
  • Publication
    Evaluating instruments for assessing healthspan: a multi-center cross-sectional study on health-related quality of life (HRQL) and frailty in the companion dog
    (2023-02-13) Chen, Frances L; Ullal, Tarini V; Graves, Jessica L; Ratcliff, Ellen R; Naka, Alexander; McKenzie, Brennen; Carttar, Tennery A; Super, Kaitlyn M; Austriaco, Jessica; Weber, Sunny Y; Vaughn, Julie; LaCroix-Fralish, Michael L; Genomics and Computational Biology
    Developing valid tools that assess key determinants of canine healthspan such as frailty and health-related quality of life (HRQL) is essential to characterizing and understanding aging in dogs. Additionally, because the companion dog is an excellent translational model for humans, such tools can be applied to evaluate gerotherapeutics and investigate mechanisms underlying longevity in both dogs and humans. In this multi-center, cross-sectional study, we investigated the use of a clinical questionnaire (Canine Frailty Index; CFI; Banzato et al., 2019) to assess frailty and an owner assessment tool (VetMetrica HRQL) to evaluate HRQL in 451 adult companion dogs. Results demonstrated validity of the tools by confirming expectations that frailty score increases and HRQL scores deteriorate with age. CFI scores were significantly higher (higher frailty) and HRQL scores significantly lower (worse HRQL) in old dogs (≥ 7 years of age) compared to young dogs (≥ 2 and < 6 years of age). Body size (small < 11.3 kg (25 lbs) or large > 22.7 kg (50 lbs)) was not associated with CFI or total HRQL score. However, older, larger dogs showed faster age-related decline in HRQL scores specific to owner-reported activity and comfort. Findings suggest that the clinician-assessed CFI and owner-reported VetMetrica HRQL are useful tools to evaluate two determinants of healthspan in dogs: the accumulation of frailty and the progressive decline in quality of life. Establishing tools that operationalize the assessment of canine healthspan is critical for the advancement of geroscience and the development of gerotherapeutics that benefit both human and veterinary medicine. Graphical summary of the design, results, and conclusions of the study.
  • Publication
    Cross-ancestry atlas of gene, isoform, and splicing regulation in the developing human brain
    (2024-05-24) Wen, Cindy; Margolis, Michael; Dai, Rujia; Zhang, Pan; Przytycki, Pawel F; Vo, Daniel D; Bhattacharya, Arjun; Matoba, Nana; Tang, Miao; Jiao, Chuan; Kim, Minsoo; Tsai, Ellen; Hoh, Celine; Aygün, Nil; Walker, Rebecca L; Chatzinakos, Christos; Clarke, Declan; Pratt, Henry E; Peters, Mette A; Gerstein, Mark; Daskalakis, Nikolaos P; Weng, Zhiping; Jaffe, Andrew E; Kleinman, Joel E; Hyde, Thomas M; Weinberger, Daniel R; Bray, Nicholas J; Sestan, Nenad; Geschwind, Daniel H; Roeder, Kathryn; Gusev, Alexander; Pasaniuc, Bogdan; Stein, Jason L; Love, Michael I; Pollard, Katherine S; Liu, Chunyu; Gandal, Michael J; Genomics and Computational Biology; Program in Bioinformatics and Integrative Biology
    Neuropsychiatric genome-wide association studies (GWASs), including those for autism spectrum disorder and schizophrenia, show strong enrichment for regulatory elements in the developing brain. However, prioritizing risk genes and mechanisms is challenging without a unified regulatory atlas. Across 672 diverse developing human brains, we identified 15,752 genes harboring gene, isoform, and/or splicing quantitative trait loci, mapping 3739 to cellular contexts. Gene expression heritability drops during development, likely reflecting both increasing cellular heterogeneity and the intrinsic properties of neuronal maturation. Isoform-level regulation, particularly in the second trimester, mediated the largest proportion of GWAS heritability. Through colocalization, we prioritized mechanisms for about 60% of GWAS loci across five disorders, exceeding adult brain findings. Finally, we contextualized results within gene and isoform coexpression networks, revealing the comprehensive landscape of transcriptome regulation in development and disease.
  • Publication
    Single-cell genomics and regulatory networks for 388 human brains
    (2024-05-24) Emani, Prashant S; Liu, Jason J; Clarke, Declan; Jensen, Matthew; Warrell, Jonathan; Gupta, Chirag; Meng, Ran; Lee, Che Yu; Xu, Siwei; Dursun, Cagatay; Lou, Shaoke; Chen, Yuhang; Chu, Zhiyuan; Galeev, Timur; Hwang, Ahyeon; Li, Yunyang; Ni, Pengyu; Zhou, Xiao; Bakken, Trygve E; Bendl, Jaroslav; Bicks, Lucy; Chatterjee, Tanima; Cheng, Lijun; Cheng, Yuyan; Dai, Yi; Duan, Ziheng; Flaherty, Mary; Fullard, John F; Gancz, Michael; Garrido-Martín, Diego; Gaynor-Gillett, Sophia; Grundman, Jennifer; Hawken, Natalie; Henry, Ella; Hoffman, Gabriel E; Huang, Ao; Jiang, Yunzhe; Jin, Ting; Jorstad, Nikolas L; Kawaguchi, Riki; Khullar, Saniya; Liu, Jianyin; Liu, Junhao; Liu, Shuang; Ma, Shaojie; Margolis, Michael; Mazariegos, Samantha; Moore, Jill E; Moran, Jennifer R; Nguyen, Eric; Phalke, Nishigandha; Pjanic, Milos; Pratt, Henry E; Quintero, Diana; Rajagopalan, Ananya S; Riesenmy, Tiernon R; Shedd, Nicole; Shi, Manman; Spector, Megan; Terwilliger, Rosemarie; Travaglini, Kyle J; Wamsley, Brie; Wang, Gaoyuan; Xia, Yan; Xiao, Shaohua; Yang, Andrew C; Zheng, Suchen; Gandal, Michael J; Lee, Donghoon; Lein, Ed S; Roussos, Panos; Sestan, Nenad; Weng, Zhiping; White, Kevin P; Won, Hyejung; Girgenti, Matthew J; Zhang, Jing; Wang, Daifeng; Geschwind, Daniel; Gerstein, Mark; Genomics and Computational Biology; Morningside Graduate School of Biomedical Sciences; Nicole Shedd
    Single-cell genomics is a powerful tool for studying heterogeneous tissues such as the brain. Yet little is understood about how genetic variants influence cell-level gene expression. Addressing this, we uniformly processed single-nuclei, multiomics datasets into a resource comprising >2.8 million nuclei from the prefrontal cortex across 388 individuals. For 28 cell types, we assessed population-level variation in expression and chromatin across gene families and drug targets. We identified >550,000 cell type-specific regulatory elements and >1.4 million single-cell expression quantitative trait loci, which we used to build cell-type regulatory and cell-to-cell communication networks. These networks manifest cellular changes in aging and neuropsychiatric disorders. We further constructed an integrative model accurately imputing single-cell expression and simulating perturbations; the model prioritized ~250 disease-risk genes and drug targets with associated cell types.
  • Publication
    Using a comprehensive atlas and predictive models to reveal the complexity and evolution of brain-active regulatory elements
    (2024-05-23) Pratt, Henry E; Andrews, Gregory; Shedd, Nicole; Phalke, Nishigandha; Li, Tongxin; Pampari, Anusri; Jensen, Matthew; Wen, Cindy; Gandal, Michael J; Geschwind, Daniel H; Gerstein, Mark; Moore, Jill E; Kundaje, Anshul; Colubri, Andrés; Weng, Zhiping; Genomics and Computational Biology; Morningside Graduate School of Biomedical Sciences; Nicole Shedd
    Most genetic variants associated with psychiatric disorders are located in noncoding regions of the genome. To investigate their functional implications, we integrate epigenetic data from the PsychENCODE Consortium and other published sources to construct a comprehensive atlas of candidate brain cis-regulatory elements. Using deep learning, we model these elements' sequence syntax and predict how binding sites for lineage-specific transcription factors contribute to cell type-specific gene regulation in various types of glia and neurons. The elements' evolutionary history suggests that new regulatory information in the brain emerges primarily via smaller sequence mutations within conserved mammalian elements rather than entirely new human- or primate-specific sequences. However, primate-specific candidate elements, particularly those active during fetal brain development and in excitatory neurons and astrocytes, are implicated in the heritability of brain-related human traits. Additionally, we introduce PsychSCREEN, a web-based platform offering interactive visualization of PsychENCODE-generated genetic and epigenetic data from diverse brain cell types in individuals with psychiatric disorders and healthy controls.
  • Publication
    Cooltools: Enabling high-resolution Hi-C analysis in Python
    (2024-05-06) Abdennur, Nezar; Abraham, Sameer; Fudenberg, Geoffrey; Flyamer, Ilya M; Galitsyna, Aleksandra A; Goloborodko, Anton; Imakaev, Maxim; Akgol Oksuz, Betul; Venev, Sergey V; Xiao, Yao; Genomics and Computational Biology; Systems Biology
    Chromosome conformation capture (3C) technologies reveal the incredible complexity of genome organization. Maps of increasing size, depth, and resolution are now used to probe genome architecture across cell states, types, and organisms. Larger datasets add challenges at each step of computational analysis, from storage and memory constraints to researchers' time; however, analysis tools that meet these increased resource demands have not kept pace. Furthermore, existing tools offer limited support for customizing analysis for specific use cases or new biology. Here we introduce cooltools (https://github.com/open2c/cooltools), a suite of computational tools that enables flexible, scalable, and reproducible analysis of high-resolution contact frequency data. Cooltools leverages the widely-adopted cooler format which handles storage and access for high-resolution datasets. Cooltools provides a paired command line interface (CLI) and Python application programming interface (API), which respectively facilitate workflows on high-performance computing clusters and in interactive analysis environments. In short, cooltools enables the effective use of the latest and largest genome folding datasets.
  • Publication
    Pairtools: From sequencing data to chromosome contacts
    (2024-05-29) Abdennur, Nezar; Fudenberg, Geoffrey; Flyamer, Ilya M; Galitsyna, Aleksandra A; Goloborodko, Anton; Imakaev, Maxim; Venev, Sergey V; Genomics and Computational Biology; Program in Bioinformatics and Integrative Biology; Systems Biology
    The field of 3D genome organization produces large amounts of sequencing data from Hi-C and a rapidly-expanding set of other chromosome conformation protocols (3C+). Massive and heterogeneous 3C+ data require high-performance and flexible processing of sequenced reads into contact pairs. To meet these challenges, we present pairtools-a flexible suite of tools for contact extraction from sequencing data. Pairtools provides modular command-line interface (CLI) tools that can be flexibly chained into data processing pipelines. The core operations provided by pairtools are parsing of.sam alignments into Hi-C pairs, sorting and removal of PCR duplicates. In addition, pairtools provides auxiliary tools for building feature-rich 3C+ pipelines, including contact pair manipulation, filtration, and quality control. Benchmarking pairtools against popular 3C+ data pipelines shows advantages of pairtools for high-performance and flexible 3C+ analysis. Finally, pairtools provides protocol-specific tools for restriction-based protocols, haplotype-resolved contacts, and single-cell Hi-C. The combination of CLI tools and tight integration with Python data analysis libraries makes pairtools a versatile foundation for a broad range of 3C+ pipelines.
  • Publication
    Genome-wide association study identifies 30 obsessive-compulsive disorder associated loci [preprint]
    (2024-03-13) Strom, Nora I; Gerring, Zachary F; Galimberti, Marco; Yu, Dongmei; Halvorsen, Matthew W; Abdellaoui, Abdel; Rodriguez-Fontenla, Cristina; Sealock, Julia M; Bigdeli, Tim; Coleman, Jonathan R; Mahjani, Behrang; Thorp, Jackson G; Bey, Katharina; Burton, Christie L; Luykx, Jurjen J; Zai, Gwyneth; Alemany, Silvia; Andre, Christine; Askland, Kathleen D; Banaj, Nerisa; Barlassina, Cristina; Nissen, Judith Becker; Bienvenu, O Joseph; Black, Donald; Bloch, Michael H; Boberg, Julia; Børte, Sigrid; Bosch, Rosa; Breen, Michael; Brennan, Brian P; Brentani, Helena; Buxbaum, Joseph D; Bybjerg-Grauholm, Jonas; Byrne, Enda M; Cabana-Dominguez, Judit; Camarena, Beatriz; Camarena, Adrian; Cappi, Carolina; Carracedo, Angel; Casas, Miguel; Cavallini, Maria Cristina; Ciullo, Valentina; Cook, Edwin H; Crosby, Jesse; Cullen, Bernadette A; De Schipper, Elles J; Delorme, Richard; Djurovic, Srdjan; Elias, Jason A; Estivill, Xavier; Falkenstein, Martha J; Fundin, Bengt T; Garner, Lauryn; German, Chris; Gironda, Christina; Goes, Fernando S; Grados, Marco A; Grove, Jakob; Guo, Wei; Haavik, Jan; Hagen, Kristen; Harrington, Kelly; Havdahl, Alexandra; Höffler, Kira D; Hounie, Ana G; Hucks, Donald; Hultman, Christina; Janecka, Magdalena; Jenike, Eric; Karlsson, Elinor K; Kelley, Kara; Klawohn, Julia; Krasnow, Janice E; Krebs, Kristi; Lange, Christoph; Lanzagorta, Nuria; Levey, Daniel; Lindblad-Toh, Kerstin; Macciardi, Fabio; Maher, Brion; Mathes, Brittany; McArthur, Evonne; McGregor, Nathaniel; McLaughlin, Nicole C; Meier, Sandra; Miguel, Euripedes C; Mulhern, Maureen; Nestadt, Paul S; Nurmi, Erika L; O'Connell, Kevin S; Osiecki, Lisa; Ousdal, Olga Therese; Palviainen, Teemu; Pedersen, Nancy L; Piras, Fabrizio; Piras, Federica; Potluri, Sriramya; Rabionet, Raquel; Ramirez, Alfredo; Rauch, Scott; Reichenberg, Abraham; Riddle, Mark A; Ripke, Stephan; Rosário, Maria C; Sampaio, Aline S; Schiele, Miriam A; Skogholt, Anne Heidi; Sloofman, Laura G Sloofman G; Smit, Jan; Soler, Artigas María; Thomas, Laurent F; Tifft, Eric; Vallada, Homero; van Kirk, Nathanial; Veenstra-VanderWeele, Jeremy; Vulink, Nienke N; Walker, Christopher P; Wang, Ying; Wendland, Jens R; Winsvold, Bendik S; Yao, Yin; Zhou, Hang; Agrawal, Arpana; Alonso, Pino; Berberich, Götz; Bucholz, Kathleen K; Bulik, Cynthia M; Cath, Danielle; Denys, Damiaan; Eapen, Valsamma; Edenberg, Howard; Falkai, Peter; Fernandez, Thomas V; Fyer, Abby J; Gaziano, J M; Geller, Dan A; Grabe, Hans J; Greenberg, Benjamin D; Hanna, Gregory L; Hickie, Ian B; Hougaard, David M; Kathmann, Norbert; Kennedy, James; Lai, Dongbing; Landén, Mikael; Le Hellard, Stéphanie; Leboyer, Marion; Lochner, Christine; McCracken, James T; Medland, Sarah E; Mortensen, Preben B; Neale, Benjamin M; Nicolini, Humberto; Nordentoft, Merete; Pato, Michele; Pato, Carlos; Pauls, David L; Piacentini, John; Pittenger, Christopher; Posthuma, Danielle; Ramos-Quiroga, Josep Antoni; Rasmussen, Steven A; Richter, Margaret A; Rosenberg, David R; Ruhrmann, Stephan; Samuels, Jack F; Sandin, Sven; Sandor, Paul; Spalletta, Gianfranco; Stein, Dan J; Stewart, S Evelyn; Storch, Eric A; Stranger, Barbara E; Turiel, Maurizio; Werge, Thomas; Andreassen, Ole A; Børglum, Anders D; Walitza, Susanne; Hveem, Kristian; Hansen, Bjarne K; Rück, Christian P; Martin, Nicholas G; Milani, Lili; Mors, Ole; Reichborn-Kjennerud, Ted; Ribasés, Marta; Kvale, Gerd; Mataix-Cols, David; Domschke, Katharina; Grünblatt, Edna; Wagner, Michael; Zwart, John-Anker; Breen, Gerome; Nestadt, Gerald; Kaprio, Jaakko; Arnold, Paul D; Grice, Dorothy E; Knowles, James A; Ask, Helga; Verweij, Karin J; Davis, Lea K; Smit, Dirk J; Crowley, James J; Scharf, Jeremiah M; Stein, Murray B; Gelernter, Joel; Mathews, Carol A; Derks, Eske M; Mattheisen, Manuel; Genomics and Computational Biology
    Obsessive-compulsive disorder (OCD) affects ~1% of the population and exhibits a high SNP-heritability, yet previous genome-wide association studies (GWAS) have provided limited information on the genetic etiology and underlying biological mechanisms of the disorder. We conducted a GWAS meta-analysis combining 53,660 OCD cases and 2,044,417 controls from 28 European-ancestry cohorts revealing 30 independent genome-wide significant SNPs and a SNP-based heritability of 6.7%. Separate GWAS for clinical, biobank, comorbid, and self-report sub-groups found no evidence of sample ascertainment impacting our results. Functional and positional QTL gene-based approaches identified 249 significant candidate risk genes for OCD, of which 25 were identified as putatively causal, highlighting WDR6, DALRD3, CTNND1 and genes in the MHC region. Tissue and single-cell enrichment analyses highlighted hippocampal and cortical excitatory neurons, along with D1- and D2-type dopamine receptor-containing medium spiny neurons, as playing a role in OCD risk. OCD displayed significant genetic correlations with 65 out of 112 examined phenotypes. Notably, it showed positive genetic correlations with all included psychiatric phenotypes, in particular anxiety, depression, anorexia nervosa, and Tourette syndrome, and negative correlations with a subset of the included autoimmune disorders, educational attainment, and body mass index.. This study marks a significant step toward unraveling its genetic landscape and advances understanding of OCD genetics, providing a foundation for future interventions to address this debilitating disorder.
  • Publication
    Inferring causal cell types of human diseases and risk variants from candidate regulatory elements [preprint]
    (2024-05-18) Kim, Artem; Zhang, Zixuan; Legros, Come; Lu, Zeyun; de Smith, Adam; Moore, Jill E; Mancuso, Nicholas; Gazal, Steven; Genomics and Computational Biology
    The heritability of human diseases is extremely enriched in candidate regulatory elements (cRE) from disease-relevant cell types. Critical next steps are to infer which and how many cell types are truly causal for a disease (after accounting for co-regulation across cell types), and to understand how individual variants impact disease risk through single or multiple causal cell types. Here, we propose CT-FM and CT-FM-SNP, two methods that leverage cell-type-specific cREs to fine-map causal cell types for a trait and for its candidate causal variants, respectively. We applied CT-FM to 63 GWAS summary statistics (average N = 417K) using nearly one thousand cRE annotations, primarily coming from ENCODE4. CT-FM inferred 81 causal cell types with corresponding SNP-annotations explaining a high fraction of trait SNP-heritability (~2/3 of the SNP-heritability explained by existing cREs), identified 16 traits with multiple causal cell types, highlighted cell-disease relationships consistent with known biology, and uncovered previously unexplored cellular mechanisms in psychiatric and immune-related diseases. Finally, we applied CT-FM-SNP to 39 UK Biobank traits and predicted high confidence causal cell types for 2,798 candidate causal non-coding SNPs. Our results suggest that most SNPs impact a phenotype through a single cell type, and that pleiotropic SNPs target different cell types depending on the phenotype context. Altogether, CT-FM and CT-FM-SNP shed light on how genetic variants act collectively and individually at the cellular level to impact disease risk.
  • Publication
    A single cell atlas of the mouse seminal vesicle [preprint]
    (2024-04-11) Sun, Fengyun; Desevin, Kathleen; Fu, Yu; Parameswaran, Shanmathi; Mayall, Jemma; Rinaldi, Vera; Krietenstein, Nils; Manukyan, Artur; Yin, Qiangzong; Galan, Carolina; Yang, Chih-Hsiang; Shindyapina, Anastasia V; Gladyshev, Vadim N; Garber, Manuel; Schjenken, John E; Rando, Oliver J; Biochemistry and Molecular Biotechnology; Genomics and Computational Biology; Morningside Graduate School of Biomedical Sciences; Systems Biology; Kathleen Desevin; Qiangzong Yin
    During mammalian reproduction, sperm are delivered to the female reproductive tract bathed in a complex medium known as seminal fluid, which plays key roles in signaling to the female reproductive tract and in nourishing sperm for their onwards journey. Along with minor contributions from the prostate and the epididymis, the majority of seminal fluid is produced by a somewhat understudied organ known as the seminal vesicle. Here, we report the first single-cell RNA-seq atlas of the mouse seminal vesicle, generated using tissues obtained from 23 mice of varying ages, exposed to a range of dietary challenges. We define the transcriptome of the secretory cells in this tissue, identifying a relatively homogeneous population of the epithelial cells which are responsible for producing the majority of seminal fluid. We also define the immune cell populations - including large populations of macrophages, dendritic cells, T cells, and NKT cells - which have the potential to play roles in producing various immune mediators present in seminal plasma. Together, our data provide a resource for understanding the composition of an understudied reproductive tissue with potential implications for paternal control of offspring development and metabolism.
  • Publication
    Social determinants of health and disease in companion dogs: a cohort study from the Dog Aging Project
    (2023-05-13) McCoy, Brianah M; Brassington, Layla; Jin, Kelly; Dolby, Greer A; Shrager, Sandi; Collins, Devin; Dunbar, Matthew; Ruple, Audrey; Snyder-Mackler, Noah; Genomics and Computational Biology
    Exposure to social environmental adversity is associated with health and survival across many social species, including humans. However, little is known about how these health and mortality effects vary across the lifespan and may be differentially impacted by various components of the environment. Here, we leveraged a relatively new and powerful model for human aging, the companion dog, to investigate which components of the social environment are associated with dog health and how these associations vary across the lifespan. We drew on comprehensive survey data collected on 21,410 dogs from the Dog Aging Project and identified five factors that together explained 33.7% of the variation in a dog's social environment. Factors capturing financial and household adversity were associated with poorer health and lower physical mobility in companion dogs, while factors that captured social support, such as living with other dogs, were associated with better health when controlling for dog age and weight. Notably, the effects of each environmental component were not equal: the effect of social support was 5× stronger than financial factors. The strength of these associations depended on the age of the dog, including a stronger relationship between the owner's age and the dog's health in younger as compared to older dogs. Taken together, these findings suggest the importance of income, stability and owner's age on owner-reported health outcomes in companion dogs and point to potential behavioral and/or environmental modifiers that can be used to promote healthy aging across species.
  • Publication
    Genome-wide association study identifies new loci associated with OCD [preprint]
    (2024-03-08) Strom, Nora I; Halvorsen, Matthew W; Tian, Chao; Rück, Christian; Kvale, Gerd; Hansen, Bjarne; Bybjerg-Grauholm, Jonas; Grove, Jakob; Boberg, Julia; Nissen, Judith Becker; Damm Als, Thomas; Werge, Thomas; de Schipper, Elles; Fundin, Bengt; Hultman, Christina; Höffler, Kira D; Pedersen, Nancy; Sandin, Sven; Bulik, Cynthia; Landén, Mikael; Karlsson, Elinor K; Hagen, Kristen; Lindblad-Toh, Kerstin; Hougaard, David M; Meier, Sandra M; Hellard, Stéphanie Le; Mors, Ole; Børglum, Anders D; Haavik, Jan; Hinds, David A; Mataix-Cols, David; Crowley, James J; Mattheisen, Manuel; Genomics and Computational Biology
    To date, four genome-wide association studies (GWAS) of obsessive-compulsive disorder (OCD) have been published, reporting a high single-nucleotide polymorphism (SNP)-heritability of 28% but finding only one significant SNP. A substantial increase in sample size will likely lead to further identification of SNPs, genes, and biological pathways mediating the susceptibility to OCD. We conducted a GWAS meta-analysis with a 2-3-fold increase in case sample size (OCD cases: N = 37,015, controls: N = 948,616) compared to the last OCD GWAS, including six previously published cohorts (OCGAS, IOCDF-GC, IOCDF-GC-trio, NORDiC-nor, NORDiC-swe, and iPSYCH) and unpublished self-report data from 23andMe Inc. We explored the genetic architecture of OCD by conducting gene-based tests, tissue and celltype enrichment analyses, and estimating heritability and genetic correlations with 74 phenotypes. To examine a potential heterogeneity in our data, we conducted multivariable GWASs with MTAG. We found support for 15 independent genome-wide significant loci (14 new) and 79 protein-coding genes. Tissue enrichment analyses implicate multiple cortical regions, the amygdala, and hypothalamus, while cell type analyses yielded 12 cell types linked to OCD (all neurons). The SNP-based heritability of OCD was estimated to be 0.08. Using MTAG we found evidence for specific genetic underpinnings characteristic of different cohort-ascertainment and identified additional significant SNPs. OCD was genetically correlated with 40 disorders or traits-positively with all psychiatric disorders and negatively with BMI, age at first birth and multiple autoimmune diseases. The GWAS meta-analysis identified several biologically informative genes as important contributors to the aetiology of OCD. Overall, we have begun laying the groundwork through which the biology of OCD will be understood and described.