Description
TitleDynamically compressed Bayesian hidden Markov models using Haar wavelets
Date Created2018
Other Date2018-10 (degree)
Extent1 online resource (147 pages : illustrations)
DescriptionHidden Markov Models (HMM) are a powerful and ubiquitous tool for segmentation and labeling in bioinformatics and beyond. Classic techniques to infer suitable model parameters from data, such as the Baum-Welch algorithm, are not guaranteed to be globally optimal, and can converge to the wrong values, which can limit the accuracy of the segmentation. Furthermore, the Viterbi algorithm only yields a single segmentation given a parameter estimate, which precludes alternative interpretations of the data. In contrast, Bayesian inference techniques do not suffer from these drawbacks since they integrate over the entire parameter space, but have gained a reputation of being computationally expensive, especially in big data applications such as whole-genome sequencing. In this thesis, we provide a dynamic compression scheme based on Haar wavelet regression to accelarate Forward-Backward Gibbs sampling for the detection of copy-number variants (CNV) in genomic data.
Using a decision-theoretic framework, we show that for homoscedastic Gaussian emissions the discontinuities in a Haar wavelet regression based on the universal threshold capture strong changes in the posterior state distribution.
This allows for a compression of the data into blocks of sufficient statistics within which the data is likely to be emitted from the same latent state. We derive the bias incurred in the forward-variables by ignoring between-state transitions for general HMM, generalizing earlier results based on the weak path assumption. The tendency of this forward bias to favor the state with the highest marginal likelihood within each block leads us to conjecture that Forward-Backward Gibbs sampling would result in a maximum posterior margins (MPM) segmentation of the data.
For heteroscedastic Gaussian HMM, we develop a Forward-Backward Gibbs sampling scheme based on dynamic Haar wavelet compression, in which the universal threshold is derived from the smallest emission variance sampled in each iteration. We develop the wavelet tree as a data structure to facilitate this sampling, and use extensive simulation studies to demonstrate the performance of our method. We show that compression greatly improves convergence, in terms of the number of sampling iterations, towards the true latent state sequence compared to uncompressed sampling, thus providing experimental support for our MPM segmentation conjecture. This faster convergence is achieved while also improving overall running time by two orders of magnitude. We also demonstrate the performance on biological gold-standard data sets, improving upon the state of the art methods while running substantially faster.
We then show that the wavelet tree yields a suboptimal compression, and derive a new compression scheme. We replace the wavelet tree by more efficient and less memory-consuming data structures called breakpoint array and integral array to facilitate fast dynamic queries of the block sufficient statistics under arbitrary thresholds, without recomputing the Haar transform at every iteration. For an efficient constructor of these data structures, we develop a linear-time, in-place algorithm called maxlet transform to compute the maximum absolute detail coefficients, as well as an equivalent algorithm for multivariate data which incurs only logarithmic overhead in a streaming setting. We also develop a linear-time, in-place algorithm called the Haar boundary transform to compute the maximum of all detail coefficients affecting the discontinuities in the Haar wavelet regression at each position given arbitrary thresholds. Furthermore, we provide an efficient, queue-based data structure to store and update posterior state marginals during Gibbs sampling. We demonstrate that plausible CNV candidates can be found using large-scale whole-genome sequencing data, and derive new scientific hypotheses about the role of CNV in the domestication syndrome.
While we conjecture that HaMMLET achieves an approximation of MPM segmentation based on theoretical considerations and experimental results, the approximation of true posterior state margins remains an open challenge. Corrections for forward bias are not straightforward under constantly changing emission parameters during Gibbs sampling. Such a correction would also lead to the paradoxical situation that compression assumes exchangeability within each block, yet one of the fundamental properties of HMM is the ability to capture non-exchangeability of consecutive observations. We have ignored this question in the scope of this thesis, since real-life applications of CNV detection are often consistent with the weak path assumption.
NotePh.D.
NoteIncludes bibliographical references
Noteby John Wiedenhoeft
Genretheses, ETD doctoral
Languageeng
CollectionSchool of Graduate Studies Electronic Theses and Dissertations
Organization NameRutgers, The State University of New Jersey
RightsThe author owns the copyright to this work.