Defining a Registry of Candidate Regulatory Elements to Interpret Disease Associated Genetic Variation
AuthorsMoore, Jill E.
Faculty AdvisorZhiping Weng, PhD
Academic ProgramBioinformatics and Computational Biology
UMass Chan AffiliationsProgram in Bioinformatics and Integrative Biology
Document TypeDoctoral Dissertation
major depressive disorder
MetadataShow full item record
AbstractOver the last decade there has been a great effort to annotate noncoding regions of the genome, particularly those that regulate gene expression. These regulatory elements contain binding sites for transcription factors (TF), which interact with one another and transcriptional machinery to initiate, enhance, or repress gene expression. The Encyclopedia of DNA Elements (ENCODE) consortium has generated thousands of epigenomic datasets, such as DNase-seq and ChIP-seq experiments, with the goal of defining such regions. By integrating these assays, we developed the Registry of candidate Regulatory Elements (cREs), a collection of putative regulatory regions across human and mouse. In total, we identified over 1.3M human and 400k mouse cREs each annotated with cell-type specific signatures (e.g. promoter-like, enhancer-like) in over 400 human and 100 mouse biosamples. We then demonstrated the biological utility of these regions by analyzing cell type enrichments for genetic variants reported by genome wide association studies (GWAS). To search and visualize these cREs, we developed the online database SCREEN (search candidate regulatory elements by ENCODE). After defining cREs, we next sought to determine their potential gene targets. To compare target gene prediction methods, we developed a comprehensive benchmark of enhancer-gene links by curating ChIA-PET, Hi-C and eQTL datasets. We then used this benchmark to evaluate unsupervised linking approaches such as the correlation of epigenomic signal. We determined that these methods have low overall performance and do not outperform simply selecting the closest gene. We then developed a supervised Random Forest model which had notably better performance than unsupervised methods. We demonstrated that this model can be applied across cell types and can be used to predict target genes for GWAS associated variants. Finally, we used the registry of cREs to annotate variants associated with psychiatric disorders. We found that these "psych SNPs" are enriched in cREs active in brain tissue and likely target genes involved in neural development pathways. We also demonstrated that psych SNPs overlap binding sites for TFs involved in neural and immune pathways. Finally, by identifying psych SNPs with allele imbalance in chromatin accessibility, we highlighted specific cases of psych SNPs altering TF binding motifs resulting in the disruption of TF binding. Overall, we demonstrated our collection of putative regulatory regions, the Registry of cREs, can be used to understand the potential biological function of noncoding variation and develop hypotheses for future testing.
Permanent Link to this Itemhttp://hdl.handle.net/20.500.14038/32309
RightsLicensed under a Creative Commons license
Except where otherwise noted, this item's license is described as Licensed under a Creative Commons license