Reduced representations for efficient analysis of genomic data: from microarray to high-throughput sequencing

Mahmud, Md Pavel

doi:doi:10.7282/T3VM49ZV

RUcore: Rutgers University Community Repository

Search
- All
- Text
- Images
- Audio
- Video
Advanced Search | Help

Search all content in all RUcore collections.
Services
Collections

Help Contact Us My Account

Home

Resource

Reduced representations for efficient analysis of genomic data

PDF

PDF format is widely accepted and good for printing.

Plug-in required

PDF-1(4.53 MB)

Citation & Export

View Usage Statistics

Staff View

Citation & Export
Hide

Simple citation

Mahmud, Md Pavel. Reduced representations for efficient analysis of genomic data. Retrieved from https://doi.org/doi:10.7282/T3VM49ZV

Export

Click here for information about Citation Management Tools at Rutgers.

Statistics
Hide

Description

TitleReduced representations for efficient analysis of genomic data

NameMahmud, Md Pavel (author); Schliep, Alexander (chair); Chen, Kevin (internal member); Farach-Colton, Martin (internal member); Freudenberg, Jan (outside member); Rutgers University; Graduate School - New Brunswick

Date Created2014

Other Date2014-10 (degree)

SubjectComputer Science, Genomes--Analysis, Markov processes--Mathematical models, Bayesian statistical decision theory

Extent1 online resource (xi, 123 p. : ill.)

DescriptionSince the genomics era has started in the ’70s, microarray technologies have been extensively used for biological applications such as gene expression profiling, copy number variation (CNV) or Single Neucleotide Polymorphism (SNP) detection. To analyze microarray data, numerous statistical and algorithmic techniques have been developed over the last two decades; specially, for detecting CNV from array comparative genomic hybridization (arrayCGH) data, Hidden Markov Models (HMMs) have been successfully used. Still, due to computational reasons, the benefits of using Bayesian HMMs have been overlooked, and their use has been, at best, minimal in practice. The large demand for computational resources has also affected the analysis of high throughput sequencing (HTS) data, which, over the last few years, has started to revolutionize the field of computational biology. For example, the most sensitive tools for mapping HTS data to reference genomes are generally ignored in favor of fast, less accurate ones. In this dissertation, we strive for reduced representations of biological data which enable us to perform efficient computations on large datasets. Since biological datasets often contain repetitive, sometimes redundant, elements, it is a natural idea to identify groups of similar elements and directly perform computations on these groups. Usually,the relevant type of similarity is specific to the type of data and application in hand. Specifically, we make the following four contributions in this thesis. First, we show that, by exploiting repetition in discrete sequences, Markov Chain Monte Carlo (MCMC) simulations of Bayesian HMM can be accelerated, which can then be applied to the DNA segmentation problem [1]. Second, in case of Gaussian observations representing copy number ratio data, we show that, through precomputing similar, contiguous observations into blocks, MCMC for Bayesian HMM can be well-approximated [2]. Third, by representing sequences to multi-dimensional vectors, we introduce a nearest neighbor based novel technique for mapping HTS data to reference genome [3]. Finally, we present a highly efficient clustering approach for HTS data, which allows us to speed-up computationally demanding, sensitive tools for mapping HTS data [4].

NotePh.D.

NoteIncludes bibliographical references

Noteby Md Pavel Mahmud

Genretheses, ETD doctoral

Persistent URLhttps://doi.org/doi:10.7282/T3VM49ZV

Languageeng

CollectionGraduate School - New Brunswick Electronic Theses and Dissertations

Organization NameRutgers, The State University of New Jersey

RightsThe author owns the copyright to this work.

Version 8.5.5

Citation & ExportHide

Simple citation

Export

StatisticsHide

Description

Citation & Export
Hide

Statistics
Hide