Description
Code and data for paper “Improving the generalization of protein expression models with mechanistic sequence information” by Shen, Kudla and Oyarzún, Nucleic Acids Research, 2025. data.zip - includes all datasets in csv format. code.zip - includes Python code for reproducing the results of the paper. 1. Code overview. It contains five jupyter notebook files, two environment files and three sub-folders. graphgym.yml: the environment for GNN.ipynb machine.yml: the environment for the other jupyter notebooks The training data of 5'-CDS (Ecoli_data.csv) is from Cambray's dataset [1], and preprocessed by Part 1-3 of nondeep_model_training.ipynb, and then used for unsupervised and supervised learning tasks. After getting the model results, the performance_plotting.ipynb to get plots in the paper. GenSLM on 5'-CDS, the analysis on Toehold dataset and promoter dataset are in separate folders. Please run the preprocessing.ipynb first to turn sequence data in the csv file to different encodings (OH = onehot, BP = biophysical properties = "mechanistic features" in the paper, mixed = a mix of the former 2 encodings) saved in *.npy. 2. Genotype analysis of 5'CDS data - Figure 1 in the paper genotype_analysis.ipynb: conduct UMAP, K-mean clustering and PCA analysis 3. Non-deep machine learning models of 5'CDS data - train models for Figure 2 & 4 in the paper nondeep_model_training.ipynb (Part 1): train 3 non-deep machine learning models on mechanistic features, one-hot encoding (Figure 2) or feature stacking (Figure 3). 4. Ensemble stacking of 5'CDS data - train models for Figure 4 in the paper nondeep_model_training.ipynb (Part 2 - 3): train 4 ensemble models based on mechanistic features or stacked features 5. CNN and GNN models of 5'CDS data - train models for Figure 4 in the paper CNN.ipynb: train CNN model on stacked features GNN.ipynb: train GNN model on stacked features 6. GenSLM embedding embedding.ipynb: Provide language model embedding of the 5'CDS data based on the GenSLM model https://github.com/ramanathanlab/genslm We first visualize the embedding by 2D UMAP with UMAP_GenSLM.ipynb The embedding then go through nondeep_model_slm.ipynb for the non-deep ML model training. 7. Toehold dataset toehold_splitting.ipynb splits original data (2019-07-08_toehold_dataset_proc_with_params_QC1.1.csv) from Angenent-Mari et al [2] into two groups from the whole toehold dataset of ~20000 sequences each, by limiting their boundary in UMAP plot (Figure 3B). After deleting the irrelevant columns, the dataset is ready for models training (toehold_2groups.csv). toehold_hamming.ipynb conducts the detailed hamming distance analysis and the phenotype plotting of the two groups. These are the results in Figure 3B of the paper. Toehold_OH.ipynb & Toehold_MF.ipynb train the RF and MLP model on the two groups with one-hot and mechanistic features encoding respectively. These are the results in Figure 3C-D of the paper. 8. Promoter dataset Promoter dataset (yeast_with_MF.csv) is from Vaishnav et al [3] and preprocessed by Nikolados et al [4], with 244 mechanistic features filled in for each promoter sequence. MF_filling.ipynb calculates the mechanistic features (binding probability) for each promoter sequence. yeast_model_MF.ipynb & yeast_model_OH.ipynb train the RF model for the two encodings and test the local and genealization performance. These are the results in Figure 3G of the paper. 9. Non-deep machine learning models on multiple mutational series of 5'CDS data - train models for Figure 2E in the paper. multiple_series_repeats.ipynb: train RF and MLP models on 1/2/4/8/16/32 mutational series References [1] Cambray, G., Guimaraes, J. C. & Arkin, A. P. Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli. Nature Biotechnology 36, 1005–1015 (2018). [2] Angenent-Mari, N. M., Garruss, A. S., Soenksen, L. R., Church, G. & Collins, J. J. A deep learning approach to programmable RNA switches. Nature Communications 11, 5057 (2020). [3] Vaishnav, E. D. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). [4] Nikolados, E.-M., Wongprommoon, A., Aodha, O. M., Cambray, G. & Oyarzún, D. A. Accuracy and data efficiency in deep learning models of protein expression. Nature Communications 13, 7755 (2022).
Data Citation
Shen, Y., Kudla, G., & Oyarzún, D. (2025). Code and data for "Improving the generalization of protein expression models with mechanistic sequence information". Zenodo. https://doi.org/10.5281/zenodo.14604309
| Date made available | 6 Jan 2025 |
|---|---|
| Publisher | Zenodo |
Research output
- 1 Article
-
Improving the generalization of protein expression models with mechanistic sequence information
Shen, Y., Kudla, G. & Oyarzún, D. A., 28 Jan 2025, In: Nucleic Acids Research. 53, 3, p. 1-14 14 p., gkaf020.Research output: Contribution to journal › Article › peer-review
Open AccessFile
Cite this
- DataSetCite