A computational pipeline to identify complex disease signals in whole exome sequencing data
Description
TitleA computational pipeline to identify complex disease signals in whole exome sequencing data
Date Created2020
Other Date2020-05 (degree)
Extent1 online resource (viii, 112 pages)
DescriptionThe recent improvement in high throughput sequencing technologies has led to the sharp decrease in the cost of sequencing and, thus, to accumulation of large deposits of genome sequence data. Protein coding DNA accounts for roughly 2% of the entire human genome and is often assessed using whole exome sequencing (WES). Genetic variants within the coding regions may have an effect on protein function, as well as contribute to disease. Unlike Mendelian diseases, which are caused by alterations of a single gene, complex diseases are determined by multiple genetic and environmental factors. Traditional linkage or association analysis may thus not be able to capture the biological pathways underlying disease pathogenesis. Here I hypothesize that function-driven machine learning-based analysis of the exome data can help elucidate these pathogenesis pathways underlying complex diseases.
I built a computational pipeline, AVA,Dx (Analysis of Variation for Association with Disease), a machine learning based method for identifying disease signal. AVA,Dx also estimates predisposition for complex diseases, including Crohn’s disease (CD), Venous Thromboembolism (VTE), and Tourette’s disorder (TD). AVA,Dx was initially developed using exomes of CD vs. HC (healthy control) individuals and showed excellent performance in discriminating between the two classes in all training and testing sets. Genes identified by AVA,Dx were statistically significantly overrepresented in known CD pathways. Additionally, the method pinpointed several novel, i.e. previously unidentified, CD genes. I further evaluated unsupervised AVA,Dx using VTE exome data attaining better risk estimates than genome wide association study (GWAS)-based polygenic risk scoring. Lastly, I built a generalized AVA,Dx pipeline and tested it with the TD exome data, identifying several likely pathogenesis pathways. These pathways overlapped across two independent cohorts and indicated a shared genetic background for TD predisposition. The collection of these results confirms my hypothesis, highlighting the relevance of genome-encoded functional changes to disease predisposition.
NotePh.D.
NoteIncludes bibliographical references
Genretheses, ETD doctoral
LanguageEnglish
CollectionSchool of Graduate Studies Electronic Theses and Dissertations
Organization NameRutgers, The State University of New Jersey
RightsThe author owns the copyright to this work.