TY - JOUR TI - Bioinformatic analysis of polyadenylation site activity in vertebrates DO - https://doi.org/doi:10.7282/T3611031 PY - 2010 AB - Most eukaryotic protein coding precursor messenger RNAs (pre-mRNAs) undergo polyadenylation after transcription. Polyadenylation is a two-step enzymatic reaction, in which the emerging pre-mRNA is cleaved from the transcription complex, and then followed by the polymerization of adenosine nucleotides starting from the cleaved 3‟ end to form the poly(A) tail. Biologically, poly(A) tail increases mRNA stability, protein translatability, and mRNA nuclear export. Surprisingly, large numbers of protein factors were found to be involved in this apparently simple cleavage and polymerization steps, suggesting that polyadenylation is under complex regulation. Hence in this study, I am interested to investigate the regulatory elements of eukaryotic polyadenylation. The proposed close species comparison approach revealed an asymmetric selection pressure around the polyadenylation cleavage site (PAS). The region from the PAS to approximately 200 nucleotides (nts) upstream was found to be under a much higher conservation than the downstream region and other part of the 3‟UTR. Furthermore, over 2,000 long (>30 nts) conserved fragments at or close to upstream of the PAS were identified through remote species comparison. A substantial portion of them are longer than 100 nts, which is much longer than any known RNA protein recognition sites. A PAS classifier was built using logistic regression in order to study the characteristics of PAS. Not only it does improve the computational recognition of mammalian PAS than existing methods, it is also helpful in identifying a small number of genes that lack of typical PAS characteristics such as the poly(A) signal and/or the U/GU rich region. These findings provide useful experimental candidates for the study of the still unclear polyadenylation compensatory and/or regulatory elements. At present, no sequence consensus has been identified for the downstream U/GU enriched region yet. Thus, I have designed a novel rule-based nucleotide sequence motif finding algorithm, called iTriplet, to target long and degenerative motifs with special attention to the PAS downstream sequence. iTriplet has been demonstrated to handle motifs longer than 20 nts, which is still a challenge to existing methods. The utility of iTriplet has been confirmed by showing it accurately predicts PAS downstream elements using a dual Luciferase reporter assay. KW - Biochemistry KW - RNA-protein interactions KW - Bioinformatics KW - Logistic regression analysis LA - eng ER -