• Analysis, Visualization, and Machine Learning of Epigenomic Data

      Purcaro, Michael J. (2017-12-12)
      The goal of the Encyclopedia of DNA Elements (ENCODE) project has been to characterize all the functional elements of the human genome. These elements include expressed transcripts and genomic regions bound by transcription factors (TFs), occupied by nucleosomes, occupied by nucleosomes with modified histones, or hypersensitive to DNase I cleavage, etc. Chromatin Immunoprecipitation (ChIP-seq) is an experimental technique for detecting TF binding in living cells, and the genomic regions bound by TFs are called ChIP-seq peaks. ENCODE has performed and compiled results from tens of thousands of experiments, including ChIP-seq, DNase, RNA-seq and Hi-C. These efforts have culminated in two web-based resources from our lab—Factorbook and SCREEN—for the exploration of epigenomic data for both human and mouse. Factorbook is a peak-centric resource presenting data such as motif enrichment and histone modification profiles for transcription factor binding sites computed from ENCODE ChIP-seq data. SCREEN provides an encyclopedia of ~2 million regulatory elements, including promoters and enhancers, identified using ENCODE ChIP-seq and DNase data, with an extensive UI for searching and visualization. While we have successfully utilized the thousands of available ENCODE ChIP-seq experiments to build the Encyclopedia and visualizers, we have also struggled with the practical and theoretical inability to assay every possible experiment on every possible biosample under every conceivable biological scenario. We have used machine learning techniques to predict TF binding sites and enhancers location, and demonstrate machine learning is critical to help decipher functional regions of the genome.
    • Defining a Registry of Candidate Regulatory Elements to Interpret Disease Associated Genetic Variation

      Moore, Jill E. (2017-10-10)
      Over the last decade there has been a great effort to annotate noncoding regions of the genome, particularly those that regulate gene expression. These regulatory elements contain binding sites for transcription factors (TF), which interact with one another and transcriptional machinery to initiate, enhance, or repress gene expression. The Encyclopedia of DNA Elements (ENCODE) consortium has generated thousands of epigenomic datasets, such as DNase-seq and ChIP-seq experiments, with the goal of defining such regions. By integrating these assays, we developed the Registry of candidate Regulatory Elements (cREs), a collection of putative regulatory regions across human and mouse. In total, we identified over 1.3M human and 400k mouse cREs each annotated with cell-type specific signatures (e.g. promoter-like, enhancer-like) in over 400 human and 100 mouse biosamples. We then demonstrated the biological utility of these regions by analyzing cell type enrichments for genetic variants reported by genome wide association studies (GWAS). To search and visualize these cREs, we developed the online database SCREEN (search candidate regulatory elements by ENCODE). After defining cREs, we next sought to determine their potential gene targets. To compare target gene prediction methods, we developed a comprehensive benchmark of enhancer-gene links by curating ChIA-PET, Hi-C and eQTL datasets. We then used this benchmark to evaluate unsupervised linking approaches such as the correlation of epigenomic signal. We determined that these methods have low overall performance and do not outperform simply selecting the closest gene. We then developed a supervised Random Forest model which had notably better performance than unsupervised methods. We demonstrated that this model can be applied across cell types and can be used to predict target genes for GWAS associated variants. Finally, we used the registry of cREs to annotate variants associated with psychiatric disorders. We found that these "psych SNPs" are enriched in cREs active in brain tissue and likely target genes involved in neural development pathways. We also demonstrated that psych SNPs overlap binding sites for TFs involved in neural and immune pathways. Finally, by identifying psych SNPs with allele imbalance in chromatin accessibility, we highlighted specific cases of psych SNPs altering TF binding motifs resulting in the disruption of TF binding. Overall, we demonstrated our collection of putative regulatory regions, the Registry of cREs, can be used to understand the potential biological function of noncoding variation and develop hypotheses for future testing.
    • Epigenetic-genetic chromatin footprinting identifies novel and subject-specific genes active in prefrontal cortex neurons

      Gusev, Fedor E.; Grigorenko, Anastasia P.; Filippova, Elena; Weng, Zhiping; Akbarian, Schahram; Rogaev, Evgeny I. (2019-04-10)
      Human prefrontal cortex (PFC) is associated with broad individual variabilities in functions linked to personality, social behaviors, and cognitive functions. The phenotype variabilities associated with brain functions can be caused by genetic or epigenetic factors. The interactions between these factors in human subjects is, as of yet, poorly understood. The heterogeneity of cerebral tissue, consisting of neuronal and nonneuronal cells, complicates the comparative analysis of gene activities in brain specimens. To approach the underlying neurogenomic determinants, we performed a deep analysis of open chromatin-associated histone methylation in PFC neurons sorted from multiple human individuals in conjunction with whole-genome and transcriptome sequencing. Integrative analyses produced novel unannotated neuronal genes and revealed individual-specific chromatin "blueprints" of neurons that, in part, relate to genetic background. Surprisingly, we observed gender-dependent epigenetic signals, implying that gender may contribute to the chromatin variabilities in neurons. Finally, we found epigenetic, allele-specific activation of the testis-specific gene nucleoporin 210 like (NUP210L) in brain in some individuals, which we link to a genetic variant occurring in < 3% of the human population. Recently, the NUP210L locus has been associated with intelligence and mathematics ability. Our findings highlight the significance of epigenetic-genetic footprinting for exploring neurologic function in a subject-specific manner.