Comparing de novo assemblers for 454 transcriptome data

Sujai Kumar, Mark L Blaxter

Research output: Contribution to journalArticlepeer-review

Abstract

Background
Roche 454 pyrosequencing has become a method of choice for generating transcriptome data from non-model organisms. Once the tens to hundreds of thousands of short (250-450 base) reads have been produced, it is important to correctly assemble these to estimate the sequence of all the transcripts. Most transcriptome assembly projects use only one program for assembling 454 pyrosequencing reads, but there is no evidence that the programs used to date are optimal. We have carried out a systematic comparison of five assemblers (CAP3, MIRA, Newbler, SeqMan and CLC) to establish best practices for transcriptome assemblies, using a new dataset from the parasitic nematode Litomosoides sigmodontis.

Results
Although no single assembler performed best on all our criteria, Newbler 2.5 gave longer contigs, better alignments to some reference sequences, and was fast and easy to use. SeqMan assemblies performed best on the criterion of recapitulating known transcripts, and had more novel sequence than the other assemblers, but generated an excess of small, redundant contigs. The remaining assemblers all performed almost as well, with the exception of Newbler 2.3 (the version currently used by most assembly projects), which generated assemblies that had significantly lower total length. As different assemblers use different underlying algorithms to generate contigs, we also explored merging of assemblies and found that the merged datasets not only aligned better to reference sequences than individual assemblies, but were also more consistent in the number and size of contigs.

Conclusions
Transcriptome assemblies are smaller than genome assemblies and thus should be more computationally tractable, but are often harder because individual contigs can have highly variable read coverage. Comparing single assemblers, Newbler 2.5 performed best on our trial data set, but other assemblers were closely comparable. Combining differently optimal assemblies from different programs however gave a more credible final product, and this strategy is recommended.
Original languageEnglish
Article number571
Number of pages12
JournalBMC Genomics
Volume11
DOIs
Publication statusPublished - Oct 2010

Keywords

  • Algorithms
  • Animals
  • Base Sequence
  • Contig Mapping
  • Databases, Genetic
  • Expressed Sequence Tags
  • Female
  • Filarioidea
  • Gene Expression Profiling
  • Gene Expression Regulation
  • Male
  • Reference Standards
  • Reproducibility of Results
  • Sequence Alignment
  • Sequence Analysis, DNA
  • Temperature

Fingerprint

Dive into the research topics of 'Comparing de novo assemblers for 454 transcriptome data'. Together they form a unique fingerprint.

Cite this