Abstract
RNA structure plays a key role in regulating many mechanisms crucial for correct cellular functioning, such as RNA stability, transcription, and mRNA translation rates. In order to identify RNA structural regulatory elements, chemical and enzymatic structure probing is routinely used to interrogate RNA structure both in vivo and in vitro [1]. In these structure probing experiments, a chemical agent reacts with the RNA molecule in a structure-dependent way, cleaving or otherwise modifying its flexible parts. These modified positions can then be detected by primer extension analyses, providing valuable structural information that can be used to constrain RNA energy-based structure prediction software and significantly improve prediction accuracy [2, 3].
Coupled with high-throughput sequencing, structure probing allows interrogation of thousands of molecules in a single reaction, holding the potential to revolutionise our understanding of the role of RNA structure in regulation of gene expression. However, despite major technological advances, intrinsic noise and high coverage requirements greatly limit the applicability of these techniques. Existing methods [4, 5, 6] do not provide strategies for correcting biases of the technology and are not sufficiently informed by inter-replicate variability in order to perform justifiable statistical assessments.
We developed a probabilistic modelling pipeline which specifically accounts for biological variability and provides automated empirical strategies to correct coverage- and sequence-dependent biases in the data. Our model supports multiple experimental replicates in both control and treatment conditions and computes empirical p-values for each nucleotide by comparing the utilised measure of variability between conditions. These p-values are then used as observations in a Beta-Uniform mixture hidden Markov model, generating posterior probabilities of modification transcriptome-wide as the model's output. This obviates the need for setting arbitrary thresholds and other post-processing.
We demonstrate on two yeast data sets that our method has greatly increased sensitivity, enabling the identification of modified regions on many more transcripts compared with existing pipelines. Our method also provides accurate and confident predictions at much lower coverage levels than those recommended in recent studies [6, 7], which are normally only met for a handful of transcripts in transcriptome-wide experiments. Our results show that statistical modelling greatly extends the scope and potential of transcriptome-wide structure probing experiments.
[1] Kubota et al. "Progress and challenges for chemical probing of RNA structure inside living cells." Nature chemical biology (2015).
[2] Wu et al. "Improved prediction of RNA secondary structure by integrating the free energy model with restraints derived from experimental probing data." Nucleic acids research (2015).
[3] Ouyang et al. "SeqFold: genome-scale reconstruction of RNA secondary structure integrating high-throughput sequencing data." Genome research (2013).
[4] Ding et al. "In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features." Nature (2014).
[5] Kielpinski et al. "Chapter Six - Reproducible Analysis of Sequencing-Based RNA Structure Probing Data with User-Friendly Tools." Methods in enzymology (2015).
[6] Talkish et al. "Mod-seq: high-throughput sequencing for chemical probing of RNA structure." RNA (2014).
[7] Siegfried et al. "RNA motif discovery by SHAPE and mutational profiling (SHAPE-MaP)." Nature methods (2014).
Coupled with high-throughput sequencing, structure probing allows interrogation of thousands of molecules in a single reaction, holding the potential to revolutionise our understanding of the role of RNA structure in regulation of gene expression. However, despite major technological advances, intrinsic noise and high coverage requirements greatly limit the applicability of these techniques. Existing methods [4, 5, 6] do not provide strategies for correcting biases of the technology and are not sufficiently informed by inter-replicate variability in order to perform justifiable statistical assessments.
We developed a probabilistic modelling pipeline which specifically accounts for biological variability and provides automated empirical strategies to correct coverage- and sequence-dependent biases in the data. Our model supports multiple experimental replicates in both control and treatment conditions and computes empirical p-values for each nucleotide by comparing the utilised measure of variability between conditions. These p-values are then used as observations in a Beta-Uniform mixture hidden Markov model, generating posterior probabilities of modification transcriptome-wide as the model's output. This obviates the need for setting arbitrary thresholds and other post-processing.
We demonstrate on two yeast data sets that our method has greatly increased sensitivity, enabling the identification of modified regions on many more transcripts compared with existing pipelines. Our method also provides accurate and confident predictions at much lower coverage levels than those recommended in recent studies [6, 7], which are normally only met for a handful of transcripts in transcriptome-wide experiments. Our results show that statistical modelling greatly extends the scope and potential of transcriptome-wide structure probing experiments.
[1] Kubota et al. "Progress and challenges for chemical probing of RNA structure inside living cells." Nature chemical biology (2015).
[2] Wu et al. "Improved prediction of RNA secondary structure by integrating the free energy model with restraints derived from experimental probing data." Nucleic acids research (2015).
[3] Ouyang et al. "SeqFold: genome-scale reconstruction of RNA secondary structure integrating high-throughput sequencing data." Genome research (2013).
[4] Ding et al. "In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features." Nature (2014).
[5] Kielpinski et al. "Chapter Six - Reproducible Analysis of Sequencing-Based RNA Structure Probing Data with User-Friendly Tools." Methods in enzymology (2015).
[6] Talkish et al. "Mod-seq: high-throughput sequencing for chemical probing of RNA structure." RNA (2014).
[7] Siegfried et al. "RNA motif discovery by SHAPE and mutational profiling (SHAPE-MaP)." Nature methods (2014).
Original language | English |
---|---|
Publication status | Published - 5 Dec 2016 |
Event | 11th Women in Machine Learning Workshop - Centre de Convencions Internacional Barcelona, Barcelona, Spain Duration: 5 Dec 2016 → 6 Dec 2016 http://wimlworkshop.org/2016/ |
Workshop
Workshop | 11th Women in Machine Learning Workshop |
---|---|
Abbreviated title | WiML 2016 |
Country/Territory | Spain |
City | Barcelona |
Period | 5/12/16 → 6/12/16 |
Internet address |