Program in Bioinformatics and Integrative Biology Publications
ABOUT THIS COLLECTION
The Program in Bioinformatics and Integrative Biology (BIB) was established in 2008 to address one of the most dynamic and central areas in biomedical research—the ever-increasing quantity of molecular information available to scientists. Our mission is to develop and explore computational and quantitative approaches and tools to help the biomedical research community maximize their understanding of the growing volume and complexity of biomedical big data. This collection showcases journal articles and other publications produced by faculty and researchers of the Program in Bioinformatics and Integrative Biology.
Recently Published
-
Beyond genome-wide association studies: Investigating the role of noncoding regulatory elements in primary sclerosing cholangitisBackground: Genome-wide association studies (GWAS) have identified 30 risk loci for primary sclerosing cholangitis (PSC). Variants within these loci are found predominantly in noncoding regions of DNA making their mechanisms of conferring risk hard to define. Epigenomic studies have shown noncoding variants broadly impact regulatory element activity. The possible association of noncoding PSC variants with regulatory element activity has not been studied. We aimed to (1) determine if the noncoding risk variants in PSC impact regulatory element function and (2) if so, assess the role these regulatory elements have in explaining the genetic risk for PSC. Methods: Available epigenomic datasets were integrated to build a comprehensive atlas of cell type-specific regulatory elements, emphasizing PSC-relevant cell types. RNA-seq and ATAC-seq were performed on peripheral CD4+ T cells from 10 PSC patients and 11 healthy controls. Computational techniques were used to (1) study the enrichment of PSC-risk variants within regulatory elements, (2) correlate risk genotype with differences in regulatory element activity, and (3) identify regulatory elements differentially active and genes differentially expressed between PSC patients and controls. Results: Noncoding PSC-risk variants are strongly enriched within immune-specific enhancers, particularly ones involved in T-cell response to antigenic stimulation. In total, 250 genes and >10,000 regulatory elements were identified that are differentially active between patients and controls. Conclusions: Mechanistic effects are proposed for variants at 6 PSC-risk loci where genotype was linked with differential T-cell regulatory element activity. Regulatory elements are shown to play a key role in PSC pathophysiology.
-
Reliable multiplex generation of pooled induced pluripotent stem cellsReprogramming somatic cells into pluripotent stem cells (iPSCs) enables the study of systems in vitro. To increase the throughput of reprogramming, we present induction of pluripotency from pooled cells (iPPC)-an efficient, scalable, and reliable reprogramming procedure. Using our deconvolution algorithm that employs pooled sequencing of single-nucleotide polymorphisms (SNPs), we accurately estimated individual donor proportions of the pooled iPSCs. With iPPC, we concurrently reprogrammed over one hundred donor lymphoblastoid cell lines (LCLs) into iPSCs and found strong correlations of individual donors' reprogramming ability across multiple experiments. Individual donors' reprogramming ability remains consistent across both same-day replicates and multiple experimental runs, and the expression of certain immunoglobulin precursor genes may impact reprogramming ability. The pooled iPSCs were also able to differentiate into cerebral organoids. Our procedure enables a multiplex framework of using pooled libraries of donor iPSCs for downstream research and investigation of in vitro phenotypes.
-
Improving diagnosis of non-malarial fevers in Senegal: Borrelia and the contribution of tick-borne bacteria [preprint]The worldwide decline in malaria incidence is revealing the extensive burden of non-malarial febrile illness (NMFI), which remains poorly understood and difficult to diagnose. To characterize NMFI in Senegal, we collected venous blood and clinical metadata from febrile patients and healthy controls in a low malaria burden area. Using 16S and unbiased sequencing, we detected viral, bacterial, or eukaryotic pathogens in 29% of NMFI cases. Bacteria were the most common, with relapsing fever Borrelia and spotted fever Rickettsia found in 15% and 3.7% of cases, respectively. Four viral pathogens were found in a total of 7 febrile cases (3.5%). Sequencing also detected undiagnosed Plasmodium, including one putative P. ovale infection. We developed a logistic regression model to distinguish Borrelia from NMFIs with similar presentation based on symptoms and vital signs. These results highlight the challenge and importance of improved diagnostics, especially for Borrelia, to support diagnosis and surveillance.
-
Using evolutionary constraint to define novel candidate driver genes in medulloblastomaCurrent knowledge of cancer genomics remains biased against noncoding mutations. To systematically search for regulatory noncoding mutations, we assessed mutations in conserved positions in the genome under the assumption that these are more likely to be functional than mutations in positions with low conservation. To this end, we use whole-genome sequencing data from the International Cancer Genome Consortium and combined it with evolutionary constraint inferred from 240 mammals, to identify genes enriched in noncoding constraint mutations (NCCMs), mutations likely to be regulatory in nature. We compare medulloblastoma (MB), which is malignant, to pilocytic astrocytoma (PA), a primarily benign tumor, and find highly different NCCM frequencies between the two, in agreement with the fact that malignant cancers tend to have more mutations. In PA, a high NCCM frequency only affects the BRAF locus, which is the most commonly mutated gene in PA. In contrast, in MB, >500 genes have high levels of NCCMs. Intriguingly, several loci with NCCMs in MB are associated with different ages of onset, such as the HOXB cluster in young MB patients. In adult patients, NCCMs occurred in, e.g., the WASF-2/AHDC1/FGR locus. One of these NCCMs led to increased expression of the SRC kinase FGR and augmented responsiveness of MB cells to dasatinib, a SRC kinase inhibitor. Our analysis thus points to different molecular pathways in different patient groups. These newly identified putative candidate driver mutations may aid in patient stratification in MB and could be valuable for future selection of personalized treatment options.
-
Aub, Vasa and Armi localization to phase separated nuage is dispensable for piRNA biogenesis and transposon silencing in Drosophila [preprint]From nematodes to placental mammals, key components of the germline transposon silencing piRNAs pathway localize to phase separated perinuclear granules. In Drosophila, the PIWI protein Aub, DEAD box protein Vasa and helicase Armi localize to nuage granules and are required for ping-pong piRNA amplification and phased piRNA processing. Drosophila piRNA mutants lead to genome instability and Chk2 kinase DNA damage signaling. By systematically analyzing piRNA pathway organization, small RNA production, and long RNA expression in single piRNA mutants and corresponding chk2/mnk double mutants, we show that Chk2 activation disrupts nuage localization of Aub and Vasa, and that the HP1 homolog Rhino, which drives piRNA precursor transcription, is required for Aub, Vasa, and Armi localization to nuage. However, these studies also show that ping-pong amplification and phased piRNA biogenesis are independent of nuage localization of Vasa, Aub and Armi. Dispersed cytoplasmic proteins thus appear to mediate these essential piRNA pathway functions.
-
Knowledge, attitudes and practices regarding the use of mobile travel health appsWe performed a survey of U.S. international travellers to evaluate their knowledge, attitudes and practices regarding mobile technologies related to health. We found that many international travellers carry smartphones and are interested in receiving health information from a mobile app when they travel abroad.
-
Performance of Rapid Antigen Tests to Detect Symptomatic and Asymptomatic SARS-CoV-2 Infection : A Prospective Cohort StudyBackground: The performance of rapid antigen tests (Ag-RDTs) for screening asymptomatic and symptomatic persons for SARS-CoV-2 is not well established. Objective: To evaluate the performance of Ag-RDTs for detection of SARS-CoV-2 among symptomatic and asymptomatic participants. Design: This prospective cohort study enrolled participants between October 2021 and January 2022. Participants completed Ag-RDTs and reverse transcriptase polymerase chain reaction (RT-PCR) testing for SARS-CoV-2 every 48 hours for 15 days. Setting: Participants were enrolled digitally throughout the mainland United States. They self-collected anterior nasal swabs for Ag-RDTs and RT-PCR testing. Nasal swabs for RT-PCR were shipped to a central laboratory, whereas Ag-RDTs were done at home. Participants: Of 7361 participants in the study, 5353 who were asymptomatic and negative for SARS-CoV-2 on study day 1 were eligible. In total, 154 participants had at least 1 positive RT-PCR result. Measurements: The sensitivity of Ag-RDTs was measured on the basis of testing once (same-day), twice (after 48 hours), and thrice (after a total of 96 hours). The analysis was repeated for different days past index PCR positivity (DPIPPs) to approximate real-world scenarios where testing initiation may not always coincide with DPIPP 0. Results were stratified by symptom status. Results: Among 154 participants who tested positive for SARS-CoV-2, 97 were asymptomatic and 57 had symptoms at infection onset. Serial testing with Ag-RDTs twice 48 hours apart resulted in an aggregated sensitivity of 93.4% (95% CI, 90.4% to 95.9%) among symptomatic participants on DPIPPs 0 to 6. When singleton positive results were excluded, the aggregated sensitivity on DPIPPs 0 to 6 for 2-time serial testing among asymptomatic participants was lower at 62.7% (CI, 57.0% to 70.5%), but it improved to 79.0% (CI, 70.1% to 87.4%) with testing 3 times at 48-hour intervals. Limitation: Participants tested every 48 hours; therefore, these data cannot support conclusions about serial testing intervals shorter than 48 hours. Conclusion: The performance of Ag-RDTs was optimized when asymptomatic participants tested 3 times at 48-hour intervals and when symptomatic participants tested 2 times separated by 48 hours. Primary funding source: National Institutes of Health RADx Tech program.
-
Modeling of mitochondrial genetic polymorphisms reveals induction of heteroplasmy by pleiotropic disease locus 10398A>GMitochondrial (MT) dysfunction has been associated with several neurodegenerative diseases including Alzheimer's disease (AD). While MT-copy number differences have been implicated in AD, the effect of MT heteroplasmy on AD has not been well characterized. Here, we analyzed over 1800 whole genome sequencing data from four AD cohorts in seven different tissue types to determine the extent of MT heteroplasmy present. While MT heteroplasmy was present throughout the entire MT genome for blood samples, we detected MT heteroplasmy only within the MT control region for brain samples. We observed that an MT variant 10398A>G (rs2853826) was significantly associated with overall MT heteroplasmy in brain tissue while also being linked with the largest number of distinct disease phenotypes of all annotated MT variants in MitoMap. Using gene-expression data from our brain samples, our modeling discovered several gene networks involved in mitochondrial respiratory chain and Complex I function associated with 10398A>G. The variant was also found to be an expression quantitative trait loci (eQTL) for the gene MT-ND3. We further characterized the effect of 10398A>G by phenotyping a population of lymphoblastoid cell-lines (LCLs) with and without the variant allele. Examination of RNA sequence data from these LCLs reveal that 10398A>G was an eQTL for MT-ND4. We also observed in LCLs that 10398A>G was significantly associated with overall MT heteroplasmy within the MT control region, confirming the initial findings observed in post-mortem brain tissue. These results provide novel evidence linking MT SNPs with MT heteroplasmy and open novel avenues for the investigation of pathomechanisms that are driven by this pleiotropic disease associated loci.
-
FACS-Based Sequencing Approach to Evaluate Cell Type to Genotype Associations Using Cerebral OrganoidsRecent technological developments have led to widespread applications of large-scale transcriptomics-based sequencing methods to identify genotype-to-cell type associations. Here we describe a fluorescence-activated cell sorting (FACS)-based sequencing method to utilize CRISPR/Cas9 edited mosaic cerebral organoids to identify or validate genotype-to-cell type associations. Our approach is high-throughput and quantitative and uses internal controls to enable comparisons of the results across different antibody markers and experiments.
-
Expression of ALS-PFN1 impairs vesicular degradation in iPSC-derived microglia [preprint]Microglia play a pivotal role in neurodegenerative disease pathogenesis, but the mechanisms underlying microglia dysfunction and toxicity remain to be fully elucidated. To investigate the effect of neurodegenerative disease-linked genes on the intrinsic properties of microglia, we studied microglia-like cells derived from human induced pluripotent stem cells (iPSCs), termed iMGs, harboring mutations in profilin-1 (PFN1) that are causative for amyotrophic lateral sclerosis (ALS). ALS-PFN1 iMGs exhibited lipid dysmetabolism and deficits in phagocytosis, a critical microglia function. Our cumulative data implicate an effect of ALS-linked PFN1 on the autophagy pathway, including enhanced binding of mutant PFN1 to the autophagy signaling molecule PI3P, as an underlying cause of defective phagocytosis in ALS-PFN1 iMGs. Indeed, phagocytic processing was restored in ALS-PFN1 iMGs with Rapamycin, an inducer of autophagic flux. These outcomes demonstrate the utility of iMGs for neurodegenerative disease research and highlight microglia vesicular degradation pathways as potential therapeutic targets for these disorders.
-
The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity [preprint]The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3' processing are deployed across human tissues, with nearly half of multitranscript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.
-
Up-regulation of cholesterol synthesis pathways and limited neurodegeneration in a knock-in mutant mouse model of ALS [preprint]Amyotrophic lateral sclerosis (ALS) is a severe neurodegenerative disorder affecting brain and spinal cord motor neurons. Mutations in the copper/zinc superoxide dismutase gene ( SOD1 ) are associated with ∼20% of inherited and 1-2% of sporadic ALS cases. Much has been learned from mice expressing transgenic copies of mutant SOD1, which typically involve high-level transgene expression, thereby differing from ALS patients expressing one mutant gene copy. To generate a model that more closely represents patient gene expression, we created a knock-in point mutation (G85R, a human ALS-causing mutation) in the endogenous mouse Sod1 gene, leading to mutant SOD1 G85R protein expression. Heterozygous Sod1 G85R mutant mice resemble wild type, whereas homozygous mutants have reduced body weight and lifespan, a mild neurodegenerative phenotype, and express very low mutant SOD1 protein levels with no detectable SOD1 activity. Homozygous mutants exhibit partial neuromuscular junction denervation at 3-4 months of age. Spinal cord motor neuron transcriptome analyses of homozygous Sod1 G85R mice revealed up-regulation of cholesterol synthesis pathway genes compared to wild type. Transcriptome and phenotypic features of these mice are similar to Sod1 knock-out mice, suggesting the Sod1 G85R phenotype is largely driven by loss of SOD1 function. By contrast, cholesterol synthesis genes are down-regulated in severely affected human TgSOD1 G93A transgenic mice at 4 months. Our analyses implicate dysregulation of cholesterol or related lipid pathway genes in ALS pathogenesis. The Sod1 G85R knock-in mouse is a useful ALS model to examine the importance of SOD1 activity in control of cholesterol homeostasis and motor neuron survival.
-
The functional and evolutionary impacts of human-specific deletions in conserved elementsConserved genomic sequences disrupted in humans may underlie uniquely human phenotypic traits. We identified and characterized 10,032 human-specific conserved deletions (hCONDELs). These short (average 2.56 base pairs) deletions are enriched for human brain functions across genetic, epigenomic, and transcriptomic datasets. Using massively parallel reporter assays in six cell types, we discovered 800 hCONDELs conferring significant differences in regulatory activity, half of which enhance rather than disrupt regulatory function. We highlight several hCONDELs with putative human-specific effects on brain development, including HDAC5, CPEB4, and PPP2CA. Reverting an hCONDEL to the ancestral sequence alters the expression of LOXL2 and developmental genes involved in myelination and synaptic function. Our data provide a rich resource to investigate the evolutionary mechanisms driving new traits in humans and other species.
-
Mammalian evolution of human cis-regulatory elements and transcription factor binding sitesUnderstanding the regulatory landscape of the human genome is a long-standing objective of modern biology. Using the reference-free alignment across 241 mammalian genomes produced by the Zoonomia Consortium, we charted evolutionary trajectories for 0.92 million human candidate cis-regulatory elements (cCREs) and 15.6 million human transcription factor binding sites (TFBSs). We identified 439,461 cCREs and 2,024,062 TFBSs under evolutionary constraint. Genes near constrained elements perform fundamental cellular processes, whereas genes near primate-specific elements are involved in environmental interaction, including odor perception and immune response. About 20% of TFBSs are transposable element-derived and exhibit intricate patterns of gains and losses during primate evolution whereas sequence variants associated with complex traits are enriched in constrained TFBSs. Our annotations illuminate the regulatory functions of the human genome.
-
Comparative genomics of Balto, a famous historic dog, captures lost diversity of 1920s sled dogsWe reconstruct the phenotype of Balto, the heroic sled dog renowned for transporting diphtheria antitoxin to Nome, Alaska, in 1925, using evolutionary constraint estimates from the Zoonomia alignment of 240 mammals and 682 genomes from dogs and wolves of the 21st century. Balto shares just part of his diverse ancestry with the eponymous Siberian husky breed. Balto's genotype predicts a combination of coat features atypical for modern sled dog breeds, and a slightly smaller stature. He had enhanced starch digestion compared with Greenland sled dogs and a compendium of derived homozygous coding variants at constrained positions in genes connected to bone and skin development. We propose that Balto's population of origin, which was less inbred and genetically healthier than that of modern breeds, was adapted to the extreme environment of 1920s Alaska.
-
Evolutionary constraint and innovation across hundreds of placental mammalsZoonomia is the largest comparative genomics resource for mammals produced to date. By aligning genomes for 240 species, we identify bases that, when mutated, are likely to affect fitness and alter disease risk. At least 332 million bases (~10.7%) in the human genome are unusually conserved across species (evolutionarily constrained) relative to neutrally evolving repeats, and 4552 ultraconserved elements are nearly perfectly conserved. Of 101 million significantly constrained single bases, 80% are outside protein-coding exons and half have no functional annotations in the Encyclopedia of DNA Elements (ENCODE) resource. Changes in genes and regulatory elements are associated with exceptional mammalian traits, such as hibernation, that could inform therapeutic development. Earth's vast and imperiled biodiversity offers distinctive power for identifying genetic variants that affect genome function and organismal phenotypes.
-
Leveraging base-pair mammalian constraint to understand genetic variation and human diseaseThousands of genomic regions have been associated with heritable human diseases, but attempts to elucidate biological mechanisms are impeded by an inability to discern which genomic positions are functionally important. Evolutionary constraint is a powerful predictor of function, agnostic to cell type or disease mechanism. Single-base phyloP scores from 240 mammals identified 3.3% of the human genome as significantly constrained and likely functional. We compared phyloP scores to genome annotation, association studies, copy-number variation, clinical genetics findings, and cancer data. Constrained positions are enriched for variants that explain common disease heritability more than other functional annotations. Our results improve variant annotation but also highlight that the regulatory landscape of the human genome still needs to be further explored and linked to disease.
-
Three-dimensional genome rewiring in loci with human accelerated regionsHuman accelerated regions (HARs) are conserved genomic loci that evolved at an accelerated rate in the human lineage and may underlie human-specific traits. We generated HARs and chimpanzee accelerated regions with an automated pipeline and an alignment of 241 mammalian genomes. Combining deep learning with chromatin capture experiments in human and chimpanzee neural progenitor cells, we discovered a significant enrichment of HARs in topologically associating domains containing human-specific genomic variants that change three-dimensional (3D) genome organization. Differential gene expression between humans and chimpanzees at these loci suggests rewiring of regulatory interactions between HARs and neurodevelopmental genes. Thus, comparative genomics together with models of 3D genome folding revealed enhancer hijacking as an explanation for the rapid evolution of HARs.
-
Insights into mammalian TE diversity through the curation of 248 genome assembliesWe examined transposable element (TE) content of 248 placental mammal genome assemblies, the largest de novo TE curation effort in eukaryotes to date. We found that although mammals resemble one another in total TE content and diversity, they show substantial differences with regard to recent TE accumulation. This includes multiple recent expansion and quiescence events across the mammalian tree. Young TEs, particularly long interspersed elements, drive increases in genome size, whereas DNA transposons are associated with smaller genomes. Mammals tend to accumulate only a few types of TEs at any given time, with one TE type dominating. We also found association between dietary habit and the presence of DNA transposon invasions. These detailed annotations will serve as a benchmark for future comparative TE analyses among placental mammals.
-
The contribution of historical processes to contemporary extinction risk in placental mammalsSpecies persistence can be influenced by the amount, type, and distribution of diversity across the genome, suggesting a potential relationship between historical demography and resilience. In this study, we surveyed genetic variation across single genomes of 240 mammals that compose the Zoonomia alignment to evaluate how historical effective population size (Ne) affects heterozygosity and deleterious genetic load and how these factors may contribute to extinction risk. We find that species with smaller historical Ne carry a proportionally larger burden of deleterious alleles owing to long-term accumulation and fixation of genetic load and have a higher risk of extinction. This suggests that historical demography can inform contemporary resilience. Models that included genomic data were predictive of species' conservation status, suggesting that, in the absence of adequate census or ecological data, genomic information may provide an initial risk assessment.