Projects per year
Abstract
The growing demand for biological products drives many efforts to maximize expression of heterologous proteins. Advances in high-throughput sequencing can produce data suitable for building sequence-to-expression models with machine learning. The most accurate models have been trained on one-hot encodings, a mechanism-agnostic representation of nucleotide sequences. Moreover, studies have consistently shown that training on mechanistic sequence features leads to much poorer predictions, even with features that are known to correlate with expression, such as DNA sequence motifs, codon usage, or properties of mRNA secondary structures. However, despite their excellent local accuracy, current sequence-to-expression models can fail to generalize predictions far away from the training data. Through a comparative study across datasets in Escherichia coli and Saccharomyces cerevisiae, here we show that mechanistic sequence features can provide gains on model generalization, and thus improve their utility for predictive sequence design. We explore several strategies to integrate one-hot encodings and mechanistic features into a single predictive model, including feature stacking, ensemble model stacking, and geometric stacking, a novel architecture based on graph convolutional neural networks. Our work casts new light on mechanistic sequence features, underscoring the importance of domain-knowledge and feature engineering for accurate prediction of protein expression levels.
Original language | English |
---|---|
Article number | gkaf020 |
Pages (from-to) | 1-14 |
Number of pages | 14 |
Journal | Nucleic Acids Research |
Volume | 53 |
Issue number | 3 |
DOIs | |
Publication status | Published - 28 Jan 2025 |
Keywords / Materials (for Non-textual outputs)
- bioinformatics and computational biology
- genomics and metagenomics
Fingerprint
Dive into the research topics of 'Improving the generalization of protein expression models with mechanistic sequence information'. Together they form a unique fingerprint.Datasets
-
Code and data for "Improving the generalization of protein expression models with mechanistic sequence information"
Shen, Y. (Creator), Kudla, G. (Creator) & Oyarzún, D. (Creator), Zenodo, 6 Jan 2025
Dataset