Staff View
Dynamically compressed Bayesian hidden Markov models using Haar wavelets

Descriptive

TitleInfo
Title
Dynamically compressed Bayesian hidden Markov models using Haar wavelets
Name (type = personal)
NamePart (type = family)
Wiedenhoeft
NamePart (type = given)
John
DisplayForm
John Wiedenhoeft
Role
RoleTerm (authority = RULIB)
author
Name (type = personal)
NamePart (type = family)
Schliep
NamePart (type = given)
Alexander
DisplayForm
Alexander Schliep
Affiliation
Advisory Committee
Role
RoleTerm (authority = RULIB)
chair
Name (type = personal)
NamePart (type = family)
Farach-Colton
NamePart (type = given)
Martin
DisplayForm
Martin Farach-Colton
Affiliation
Advisory Committee
Role
RoleTerm (authority = RULIB)
co-chair
Name (type = personal)
NamePart (type = family)
Borgida
NamePart (type = given)
Alexander
DisplayForm
Alexander Borgida
Affiliation
Advisory Committee
Role
RoleTerm (authority = RULIB)
internal member
Name (type = personal)
NamePart (type = family)
Krogh
NamePart (type = given)
Anders
DisplayForm
Anders Krogh
Affiliation
Advisory Committee
Role
RoleTerm (authority = RULIB)
outside member
Name (type = corporate)
NamePart
Rutgers University
Role
RoleTerm (authority = RULIB)
degree grantor
Name (type = corporate)
NamePart
School of Graduate Studies
Role
RoleTerm (authority = RULIB)
school
TypeOfResource
Text
Genre (authority = marcgt)
theses
OriginInfo
DateCreated (qualifier = exact)
2018
DateOther (qualifier = exact); (type = degree)
2018-10
CopyrightDate (encoding = w3cdtf)
2018
Place
PlaceTerm (type = code)
xx
Language
LanguageTerm (authority = ISO639-2b); (type = code)
eng
Abstract (type = abstract)
Hidden Markov Models (HMM) are a powerful and ubiquitous tool for segmentation and labeling in bioinformatics and beyond. Classic techniques to infer suitable model parameters from data, such as the Baum-Welch algorithm, are not guaranteed to be globally optimal, and can converge to the wrong values, which can limit the accuracy of the segmentation. Furthermore, the Viterbi algorithm only yields a single segmentation given a parameter estimate, which precludes alternative interpretations of the data. In contrast, Bayesian inference techniques do not suffer from these drawbacks since they integrate over the entire parameter space, but have gained a reputation of being computationally expensive, especially in big data applications such as whole-genome sequencing. In this thesis, we provide a dynamic compression scheme based on Haar wavelet regression to accelarate Forward-Backward Gibbs sampling for the detection of copy-number variants (CNV) in genomic data.

Using a decision-theoretic framework, we show that for homoscedastic Gaussian emissions the discontinuities in a Haar wavelet regression based on the universal threshold capture strong changes in the posterior state distribution.
This allows for a compression of the data into blocks of sufficient statistics within which the data is likely to be emitted from the same latent state. We derive the bias incurred in the forward-variables by ignoring between-state transitions for general HMM, generalizing earlier results based on the weak path assumption. The tendency of this forward bias to favor the state with the highest marginal likelihood within each block leads us to conjecture that Forward-Backward Gibbs sampling would result in a maximum posterior margins (MPM) segmentation of the data.

For heteroscedastic Gaussian HMM, we develop a Forward-Backward Gibbs sampling scheme based on dynamic Haar wavelet compression, in which the universal threshold is derived from the smallest emission variance sampled in each iteration. We develop the wavelet tree as a data structure to facilitate this sampling, and use extensive simulation studies to demonstrate the performance of our method. We show that compression greatly improves convergence, in terms of the number of sampling iterations, towards the true latent state sequence compared to uncompressed sampling, thus providing experimental support for our MPM segmentation conjecture. This faster convergence is achieved while also improving overall running time by two orders of magnitude. We also demonstrate the performance on biological gold-standard data sets, improving upon the state of the art methods while running substantially faster.

We then show that the wavelet tree yields a suboptimal compression, and derive a new compression scheme. We replace the wavelet tree by more efficient and less memory-consuming data structures called breakpoint array and integral array to facilitate fast dynamic queries of the block sufficient statistics under arbitrary thresholds, without recomputing the Haar transform at every iteration. For an efficient constructor of these data structures, we develop a linear-time, in-place algorithm called maxlet transform to compute the maximum absolute detail coefficients, as well as an equivalent algorithm for multivariate data which incurs only logarithmic overhead in a streaming setting. We also develop a linear-time, in-place algorithm called the Haar boundary transform to compute the maximum of all detail coefficients affecting the discontinuities in the Haar wavelet regression at each position given arbitrary thresholds. Furthermore, we provide an efficient, queue-based data structure to store and update posterior state marginals during Gibbs sampling. We demonstrate that plausible CNV candidates can be found using large-scale whole-genome sequencing data, and derive new scientific hypotheses about the role of CNV in the domestication syndrome.

While we conjecture that HaMMLET achieves an approximation of MPM segmentation based on theoretical considerations and experimental results, the approximation of true posterior state margins remains an open challenge. Corrections for forward bias are not straightforward under constantly changing emission parameters during Gibbs sampling. Such a correction would also lead to the paradoxical situation that compression assumes exchangeability within each block, yet one of the fundamental properties of HMM is the ability to capture non-exchangeability of consecutive observations. We have ignored this question in the scope of this thesis, since real-life applications of CNV detection are often consistent with the weak path assumption.
Subject (authority = RUETD)
Topic
Computer Science
Subject (authority = ETD-LCSH)
Topic
Bayesian field theory
Subject (authority = ETD-LCSH)
Topic
Hidden Markov models
RelatedItem (type = host)
TitleInfo
Title
Rutgers University Electronic Theses and Dissertations
Identifier (type = RULIB)
ETD
Identifier
ETD_9218
PhysicalDescription
Form (authority = gmd)
electronic resource
InternetMediaType
application/pdf
InternetMediaType
text/xml
Extent
1 online resource (147 pages : illustrations)
Note (type = degree)
Ph.D.
Note (type = bibliography)
Includes bibliographical references
Note (type = statement of responsibility)
by John Wiedenhoeft
RelatedItem (type = host)
TitleInfo
Title
School of Graduate Studies Electronic Theses and Dissertations
Identifier (type = local)
rucore10001600001
Location
PhysicalLocation (authority = marcorg); (displayLabel = Rutgers, The State University of New Jersey)
NjNbRU
Identifier (type = doi)
doi:10.7282/t3-4e1k-ph18
Genre (authority = ExL-Esploro)
ETD doctoral
Back to the top

Rights

RightsDeclaration (ID = rulibRdec0006)
The author owns the copyright to this work.
RightsHolder (type = personal)
Name
FamilyName
Wiedenhoeft
GivenName
John
Role
Copyright Holder
RightsEvent
Type
Permission or license
DateTime (encoding = w3cdtf); (qualifier = exact); (point = start)
2018-09-21 10:30:53
AssociatedEntity
Name
John Wiedenhoeft
Role
Copyright holder
Affiliation
Rutgers University. School of Graduate Studies
AssociatedObject
Type
License
Name
Author Agreement License
Detail
I hereby grant to the Rutgers University Libraries and to my school the non-exclusive right to archive, reproduce and distribute my thesis or dissertation, in whole or in part, and/or my abstract, in whole or in part, in and from an electronic format, subject to the release date subsequently stipulated in this submittal form and approved by my school. I represent and stipulate that the thesis or dissertation and its abstract are my original work, that they do not infringe or violate any rights of others, and that I make these grants as the sole owner of the rights to my thesis or dissertation and its abstract. I represent that I have obtained written permissions, when necessary, from the owner(s) of each third party copyrighted matter to be included in my thesis or dissertation and will supply copies of such upon request by my school. I acknowledge that RU ETD and my school will not distribute my thesis or dissertation or its abstract if, in their reasonable judgment, they believe all such rights have not been secured. I acknowledge that I retain ownership rights to the copyright of my work. I also retain the right to use all or part of this thesis or dissertation in future works, such as articles or books.
Copyright
Status
Copyright protected
Availability
Status
Open
Reason
Permission or license
Back to the top

Technical

RULTechMD (ID = TECHNICAL1)
ContentModel
ETD
OperatingSystem (VERSION = 5.1)
windows xp
CreatingApplication
Version
1.5
ApplicationName
pdfTeX-1.40.18
DateCreated (point = end); (encoding = w3cdtf); (qualifier = exact)
2018-10-01T19:51:36
DateCreated (point = end); (encoding = w3cdtf); (qualifier = exact)
2018-10-01T19:51:36
Back to the top
Version 8.5.5
Rutgers University Libraries - Copyright ©2024