Investigating the impact of database choice on the accuracy of metagenomic read classification for the rumen microbiome

Becky Smith*, Laura Glendinning, Alan W. Walker, Mick Watson

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review


Microbiome analysis is quickly moving towards high-throughput methods such as metagenomic sequencing. Accurate taxonomic classification of metagenomic data relies on reference sequence databases, and their associated taxonomy. However, for understudied environments such as the rumen microbiome many sequences will be derived from novel or uncultured microbes that are not present in reference databases. As a result, taxonomic classification of metagenomic data from understudied environments may be inaccurate. To assess the accuracy of taxonomic read classification, this study classified metagenomic data that had been simulated from cultured rumen microbial genomes from the Hungate collection. To assess the impact of reference databases on the accuracy of taxonomic classification, the data was classified with Kraken 2 using several reference databases. We found that the choice and composition of reference database significantly impacted on taxonomic classification results, and accuracy. In particular, NCBI RefSeq proved to be a poor choice of database. Our results indicate that inaccurate read classification is likely to be a significant problem, affecting all studies that use insufficient reference databases. We observed that adding cultured reference genomes from the rumen to the reference database greatly improved classification rate and accuracy. We also demonstrated that metagenome-assembled genomes (MAGs) have the potential to further enhance classification accuracy by representing uncultivated microbes, sequences of which would otherwise be unclassified or incorrectly classified. However, classification accuracy was strongly dependent on the taxonomic labels assigned to these MAGs. We therefore highlight the importance of accurate reference taxonomic information and suggest that, with formal taxonomic lineages, MAGs have the potential to improve classification rate and accuracy, particularly in environments such as the rumen that are understudied or contain many novel genomes.

Original languageEnglish
Article number57
Pages (from-to)1-13
Number of pages13
JournalAnimal Microbiome
Issue number57
Early online date18 Nov 2022
Publication statusPublished - 18 Nov 2022


  • Metagenome-assembled genomes
  • Metagenome
  • Rumen
  • Microbiome
  • Reference databases
  • Read classification
  • Taxonomy


Dive into the research topics of 'Investigating the impact of database choice on the accuracy of metagenomic read classification for the rumen microbiome'. Together they form a unique fingerprint.

Cite this