Skip to main navigation Skip to search Skip to main content

Code and data for "Improving the generalization of protein expression models with mechanistic sequence information"

Dataset

Description

Code and data for paper “Improving the generalization of protein expression models with mechanistic sequence information” by Shen, Kudla and Oyarzún, Nucleic Acids Research, 2025. data.zip - includes all datasets in csv format. code.zip - includes Python code for reproducing the results of the paper. 1. Code overview. It contains five jupyter notebook files, two environment files and three sub-folders. graphgym.yml: the environment for GNN.ipynb machine.yml: the environment for the other jupyter notebooks The training data of 5'-CDS (Ecoli_data.csv) is from Cambray's dataset [1], and preprocessed by Part 1-3 of nondeep_model_training.ipynb, and then used for unsupervised and supervised learning tasks. After getting the model results, the performance_plotting.ipynb to get plots in the paper. GenSLM on 5'-CDS, the analysis on Toehold dataset and promoter dataset are in separate folders. Please run the preprocessing.ipynb first to turn sequence data in the csv file to different encodings (OH = onehot, BP = biophysical properties = "mechanistic features" in the paper, mixed = a mix of the former 2 encodings) saved in *.npy. 2. Genotype analysis of 5'CDS data - Figure 1 in the paper genotype_analysis.ipynb: conduct UMAP, K-mean clustering and PCA analysis 3. Non-deep machine learning models of 5'CDS data - train models for Figure 2 & 4 in the paper nondeep_model_training.ipynb (Part 1): train 3 non-deep machine learning models on mechanistic features, one-hot encoding (Figure 2) or feature stacking (Figure 3). 4. Ensemble stacking of 5'CDS data - train models for Figure 4 in the paper nondeep_model_training.ipynb (Part 2 - 3): train 4 ensemble models based on mechanistic features or stacked features 5. CNN and GNN models of 5'CDS data - train models for Figure 4 in the paper CNN.ipynb: train CNN model on stacked features GNN.ipynb: train GNN model on stacked features 6. GenSLM embedding embedding.ipynb: Provide language model embedding of the 5'CDS data based on the GenSLM model https://github.com/ramanathanlab/genslm We first visualize the embedding by 2D UMAP with UMAP_GenSLM.ipynb The embedding then go through nondeep_model_slm.ipynb for the non-deep ML model training. 7. Toehold dataset toehold_splitting.ipynb splits original data (2019-07-08_toehold_dataset_proc_with_params_QC1.1.csv) from Angenent-Mari et al [2] into two groups from the whole toehold dataset of ~20000 sequences each, by limiting their boundary in UMAP plot (Figure 3B). After deleting the irrelevant columns, the dataset is ready for models training (toehold_2groups.csv). toehold_hamming.ipynb conducts the detailed hamming distance analysis and the phenotype plotting of the two groups. These are the results in Figure 3B of the paper. Toehold_OH.ipynb & Toehold_MF.ipynb train the RF and MLP model on the two groups with one-hot and mechanistic features encoding respectively. These are the results in Figure 3C-D of the paper. 8. Promoter dataset Promoter dataset (yeast_with_MF.csv) is from Vaishnav et al [3] and preprocessed by Nikolados et al [4], with 244 mechanistic features filled in for each promoter sequence. MF_filling.ipynb calculates the mechanistic features (binding probability) for each promoter sequence. yeast_model_MF.ipynb & yeast_model_OH.ipynb train the RF model for the two encodings and test the local and genealization performance. These are the results in Figure 3G of the paper. 9. Non-deep machine learning models on multiple mutational series of 5'CDS data - train models for Figure 2E in the paper. multiple_series_repeats.ipynb: train RF and MLP models on 1/2/4/8/16/32 mutational series References [1] Cambray, G., Guimaraes, J. C. & Arkin, A. P. Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli. Nature Biotechnology 36, 1005–1015 (2018). [2] Angenent-Mari, N. M., Garruss, A. S., Soenksen, L. R., Church, G. & Collins, J. J. A deep learning approach to programmable RNA switches. Nature Communications 11, 5057 (2020). [3] Vaishnav, E. D. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). [4] Nikolados, E.-M., Wongprommoon, A., Aodha, O. M., Cambray, G. & Oyarzún, D. A. Accuracy and data efficiency in deep learning models of protein expression. Nature Communications 13, 7755 (2022).

Data Citation

Shen, Y., Kudla, G., & Oyarzún, D. (2025). Code and data for "Improving the generalization of protein expression models with mechanistic sequence information". Zenodo. https://doi.org/10.5281/zenodo.14604309
Date made available6 Jan 2025
PublisherZenodo

Cite this