Dynamically compressed Bayesian hidden Markov models using Haar wavelets

RUcore: Rutgers University Community Repository

Search
- All
- Text
- Images
- Audio
- Video
Advanced Search | Help

Search all content in all RUcore collections.
Services
Collections

Help Contact Us My Account

Home

Resource

Staff View

Dynamically compressed Bayesian hidden Markov models using Haar wavelets

Descriptive Metadata

Rights Metadata

Technical Metadata

Descriptive

TitleInfo

Title

Dynamically compressed Bayesian hidden Markov models using Haar wavelets

Name (type = personal)

NamePart (type = family)

Wiedenhoeft

NamePart (type = given)

John

DisplayForm

John Wiedenhoeft

Role

RoleTerm (authority = RULIB)

author

Name (type = personal)

NamePart (type = family)

Schliep

NamePart (type = given)

Alexander

DisplayForm

Alexander Schliep

Affiliation

Advisory Committee

Role

RoleTerm (authority = RULIB)

chair

Name (type = personal)

NamePart (type = family)

Farach-Colton

NamePart (type = given)

Martin

DisplayForm

Martin Farach-Colton

Affiliation

Advisory Committee

Role

RoleTerm (authority = RULIB)

co-chair

Name (type = personal)

NamePart (type = family)

Borgida

NamePart (type = given)

Alexander

DisplayForm

Alexander Borgida

Affiliation

Advisory Committee

Role

RoleTerm (authority = RULIB)

internal member

Name (type = personal)

NamePart (type = family)

Krogh

NamePart (type = given)

Anders

DisplayForm

Anders Krogh

Affiliation

Advisory Committee

Role

RoleTerm (authority = RULIB)

outside member

Name (type = corporate)

NamePart

Rutgers University

Role

RoleTerm (authority = RULIB)

degree grantor

Name (type = corporate)

NamePart

School of Graduate Studies

Role

RoleTerm (authority = RULIB)

school

TypeOfResource

Text

Genre (authority = marcgt)

theses

OriginInfo

DateCreated (qualifier = exact)

2018

DateOther (qualifier = exact); (type = degree)

2018-10

CopyrightDate (encoding = w3cdtf)

2018

Place

PlaceTerm (type = code)

Language

LanguageTerm (authority = ISO639-2b); (type = code)

eng

Abstract (type = abstract)

Hidden Markov Models (HMM) are a powerful and ubiquitous tool for segmentation and labeling in bioinformatics and beyond. Classic techniques to infer suitable model parameters from data, such as the Baum-Welch algorithm, are not guaranteed to be globally optimal, and can converge to the wrong values, which can limit the accuracy of the segmentation. Furthermore, the Viterbi algorithm only yields a single segmentation given a parameter estimate, which precludes alternative interpretations of the data. In contrast, Bayesian inference techniques do not suffer from these drawbacks since they integrate over the entire parameter space, but have gained a reputation of being computationally expensive, especially in big data applications such as whole-genome sequencing. In this thesis, we provide a dynamic compression scheme based on Haar wavelet regression to accelarate Forward-Backward Gibbs sampling for the detection of copy-number variants (CNV) in genomic data.

Using a decision-theoretic framework, we show that for homoscedastic Gaussian emissions the discontinuities in a Haar wavelet regression based on the universal threshold capture strong changes in the posterior state distribution.
This allows for a compression of the data into blocks of sufficient statistics within which the data is likely to be emitted from the same latent state. We derive the bias incurred in the forward-variables by ignoring between-state transitions for general HMM, generalizing earlier results based on the weak path assumption. The tendency of this forward bias to favor the state with the highest marginal likelihood within each block leads us to conjecture that Forward-Backward Gibbs sampling would result in a maximum posterior margins (MPM) segmentation of the data.

For heteroscedastic Gaussian HMM, we develop a Forward-Backward Gibbs sampling scheme based on dynamic Haar wavelet compression, in which the universal threshold is derived from the smallest emission variance sampled in each iteration. We develop the wavelet tree as a data structure to facilitate this sampling, and use extensive simulation studies to demonstrate the performance of our method. We show that compression greatly improves convergence, in terms of the number of sampling iterations, towards the true latent state sequence compared to uncompressed sampling, thus providing experimental support for our MPM segmentation conjecture. This faster convergence is achieved while also improving overall running time by two orders of magnitude. We also demonstrate the performance on biological gold-standard data sets, improving upon the state of the art methods while running substantially faster.

We then show that the wavelet tree yields a suboptimal compression, and derive a new compression scheme. We replace the wavelet tree by more efficient and less memory-consuming data structures called breakpoint array and integral array to facilitate fast dynamic queries of the block sufficient statistics under arbitrary thresholds, without recomputing the Haar transform at every iteration. For an efficient constructor of these data structures, we develop a linear-time, in-place algorithm called maxlet transform to compute the maximum absolute detail coefficients, as well as an equivalent algorithm for multivariate data which incurs only logarithmic overhead in a streaming setting. We also develop a linear-time, in-place algorithm called the Haar boundary transform to compute the maximum of all detail coefficients affecting the discontinuities in the Haar wavelet regression at each position given arbitrary thresholds. Furthermore, we provide an efficient, queue-based data structure to store and update posterior state marginals during Gibbs sampling. We demonstrate that plausible CNV candidates can be found using large-scale whole-genome sequencing data, and derive new scientific hypotheses about the role of CNV in the domestication syndrome.

While we conjecture that HaMMLET achieves an approximation of MPM segmentation based on theoretical considerations and experimental results, the approximation of true posterior state margins remains an open challenge. Corrections for forward bias are not straightforward under constantly changing emission parameters during Gibbs sampling. Such a correction would also lead to the paradoxical situation that compression assumes exchangeability within each block, yet one of the fundamental properties of HMM is the ability to capture non-exchangeability of consecutive observations. We have ignored this question in the scope of this thesis, since real-life applications of CNV detection are often consistent with the weak path assumption.

Subject (authority = RUETD)

Topic

Computer Science

Subject (authority = ETD-LCSH)

Topic

Bayesian field theory

Subject (authority = ETD-LCSH)

Topic

Hidden Markov models

RelatedItem (type = host)

TitleInfo

Title

Rutgers University Electronic Theses and Dissertations

Identifier (type = RULIB)

ETD

Identifier

ETD_9218

PhysicalDescription

Form (authority = gmd)

electronic resource

InternetMediaType

application/pdf

InternetMediaType

text/xml

Extent

1 online resource (147 pages : illustrations)

Note (type = degree)

Ph.D.

Note (type = bibliography)

Includes bibliographical references

Note (type = statement of responsibility)

by John Wiedenhoeft

RelatedItem (type = host)

TitleInfo

Title

School of Graduate Studies Electronic Theses and Dissertations

Identifier (type = local)

rucore10001600001

Location

PhysicalLocation (authority = marcorg); (displayLabel = Rutgers, The State University of New Jersey)

NjNbRU

Identifier (type = doi)

doi:10.7282/t3-4e1k-ph18

Genre (authority = ExL-Esploro)

ETD doctoral

Back to the top

Rights

RightsDeclaration (ID = rulibRdec0006)

The author owns the copyright to this work.

RightsHolder (type = personal)

Name

FamilyName

Wiedenhoeft

GivenName

John

Role

RightsEvent

Type

Permission or license

DateTime (encoding = w3cdtf); (qualifier = exact); (point = start)

2018-09-21 10:30:53

AssociatedEntity

Name

John Wiedenhoeft

Role

Affiliation

Rutgers University. School of Graduate Studies

AssociatedObject

Type

License

Name

Author Agreement License

Detail

I hereby grant to the Rutgers University Libraries and to my school the non-exclusive right to archive, reproduce and distribute my thesis or dissertation, in whole or in part, and/or my abstract, in whole or in part, in and from an electronic format, subject to the release date subsequently stipulated in this submittal form and approved by my school. I represent and stipulate that the thesis or dissertation and its abstract are my original work, that they do not infringe or violate any rights of others, and that I make these grants as the sole owner of the rights to my thesis or dissertation and its abstract. I represent that I have obtained written permissions, when necessary, from the owner(s) of each third party copyrighted matter to be included in my thesis or dissertation and will supply copies of such upon request by my school. I acknowledge that RU ETD and my school will not distribute my thesis or dissertation or its abstract if, in their reasonable judgment, they believe all such rights have not been secured. I acknowledge that I retain ownership rights to the copyright of my work. I also retain the right to use all or part of this thesis or dissertation in future works, such as articles or books.

Status

Availability

Status

Open

Reason

Permission or license

Back to the top

Technical

RULTechMD (ID = TECHNICAL1)

ContentModel

ETD

OperatingSystem (VERSION = 5.1)

windows xp

CreatingApplication

Version

1.5

ApplicationName

pdfTeX-1.40.18

DateCreated (point = end); (encoding = w3cdtf); (qualifier = exact)

2018-10-01T19:51:36

DateCreated (point = end); (encoding = w3cdtf); (qualifier = exact)

2018-10-01T19:51:36

Back to the top

Version 8.5.5