Physically interpretable machine learning methods for transcription factor binding site identification using principled energy thresholds and occupancy
PDF
PDF format is widely accepted and good for printing.
Drawid, Amar Mohan. Physically interpretable machine learning methods for transcription factor binding site identification using principled energy thresholds and occupancy. Retrieved from https://doi.org/doi:10.7282/T3FT8M9X
TitlePhysically interpretable machine learning methods for transcription factor binding site identification using principled energy thresholds and occupancy
DescriptionRegulation of gene expression is pivotal to cell behavior. It is achieved predominantly by transcription factor proteins binding to specific DNA sequences (sites) in gene promoters. Identification of these short, degenerate sites is therefore an important problem in biology. The major drawbacks of the probabilistic machine learning methods in vogue are the use of arbitrary thresholds and the lack of biophysical interpretations of statistical quantities. We have developed two machine learning methods and linked them to the biophysics of transcription factor binding by incorporating simple physical interactions. These methods estimate site binding energy, recognizing that it determines a site's function and evolutionary fitness. They use the occupancy probability of a transcription factor on a DNA sequence as the discriminant function because it has a straightforward physical interpretation, forms a bridge between binding energy and evolutionary fitness, and has a natural threshold for classifying sequences into sites that allows establishing the threshold in a principled manner. Our methods incorporate additional characteristics of sites to enhance their identification. The first method, based on a hidden Markov model (HMM), identifies self-overlapping sites by combining the effects of their alternative binding modes. It learns the threshold by training emission probabilities using unaligned sequences containing known sites and estimating transition probabilities to reflect site density in all promoters in a genome. While identifying sites, it adjusts parameters to model site density changing with the distance from the transcription start site. Moreover, it provides guidance for designing padding sequences in experiments involving self-overlapping sites. Our second method, the Phylogeny-based Quadratic Programming Method of Energy Matrix Estimation (PhyloQPMEME), integrates evolutionary conservation to reduce false positives while identifying sites. It learns the threshold by solving an iterative quadratic programming problem to optimize the distribution of correlated binding energies of neutrally evolving orthologous sequences while restricting the values of binding energies of known sites and their orthologs. We have used the NF-κB transcription factor family as a case study for both methods and gained new insights into its biology.