NextGenPartiGene: next generation transcriptome assembly annotation and exploitation toolkit

  • Blaxter, Mark (Principal Investigator)

Project Details

Description

The next generation of DNA sequence acquisition platforms is transforming the way research programs can collect data
on new species of interest. However the tools for effective analysis and exploitation of these data are only now being
produced. One lacuna in the toolkit for next generation genomics data analysis is the area of de novo transcriptome
assembly and annotation. While mature tools for Sanger dideoxy transcriptomics (such as PartiGene) have wide take-up,
there is no such tool for transcriptome data emerging from, especially, the Roche 454 Titanium platform.
We propose the complete reengineering of the well-tried PartiGene suite of transcriptome tools to deliver
NextGenPartiGene, wherein we will embed best practice in data filtering, assembly, annotation and interpretation tools.
We will write the tool using Grails, a modern web framework that provides a robust platform for building rich web
applications. Using Grails will let us take advantage of existing libraries for common web application tasks, allowing us to
focus our development effort on the specific transcriptome analysis and visualization tools. The NextGenPartiGene user
interface will have two modes: assembling/annotating (for use by the individual building the dataset) and browsing/datamining
(for use by workers querying the data with specific research questions). Using the Grails server-client architecture
we can facilitate different levels of access to the data.
For the assembler-annotater client
(a) we will devise an open database schema to hold the new data types required for next generation sequence and its
annotation.
(b) we will build a workflow that takes raw Roche 454 sequence, trims it for adapters and quality, assembles it using bestpractice
routines, and generates assembled contigs and mapped reads.
(c) we will build routines that use BLAST, InterProScan, annot8r, Psort and other annotation tools to decorate these
contigs with functional information.
(d) we will store these data in a relational database, and provide tools for data summary.
For the browser-miner client
(e) we will provide intuitive interfaces to the data, from raw reads to predicted functions.
(f) we will provide visualisation tools for data summaries to aid searching and browsing.
(g) we will provide data download options to allow researchers to analyse selected data in external programmes.
The software will be published in open-access journals and made available under open source licencing. We will run an
instance of the software from a dedicated computer in Edinburgh to show off its features.

Layman's description

Biologists have access to ever improving toolkits with which to ask probing questions of the natural world. One
revolutionary development that has taken place over the last forty years is the advent of DNA sequencing. We now have
the ability to decipher the genome sequence (or 'genetic blueprint') of any organism, and from this work out how they tick.
About five years ago, this genomics revolution stepped up a gear, with the introduction of DNA sequencing technologies
that increased the rate of genome sequencing, and reduced the cost, many, many fold. These 'next generation'
technologies have suddenly made it possible for many researchers to start using genome sequencing in their work.
However, as with any new technology, new solutions bring new problems. In the case of genome sequencing it is a 'rich
person's' problem: researchers now can generate hundreds to thousands of times as much data as they used to, in a
small fraction of the time, but they do not have the computer tools to process and understand it. The reduced cost of
sequencing also means that many researchers who now can afford to use this technology do not have the long training
required in computing to successfully analyse the floods of data.
We propose to develop a set of easy-to-use tools, which we call NextGenPartiGene, using 'next generation' computing
frameworks, that will alleviate this problem. We are focussing on the problem of working out what genes an organism is
using (or 'expressing'), and what it is that these genes are likely to be doing. By sampling only the expressed genes of an
organism (or a part of an organism, such as a leaf or a particular tissue type) it is possible to build up a detailed picture of
the kinds of biochemical pathways the organism is running (what it can eat and what wastes it produces), and how
experimental interventions change these pathways.
We will build the NextGenPartiGene toolkit using an emerging model for such projects: the idea that much of the hard
work is done by a server computer, running clever programmes behind the scenes, and that this server is driven by a
client, accessed through a standard web browser. By building this client-server toolkit, we will be able to guide
researchers with vast amounts of next-generation sequencing data down the best-practice, tried-and-tested paths to full
and fruitful analysis. This means they will be able to extract maximum information from their data, and maximum value
from their research funding.
We will release the NextGenPartiGene tools as open-access software, so that others are both free to use it, and free to
modify and improve it to fit their needs.

Key findings

the AfterParty web application suite is available for beta testing at the AfterParty web site. The tool incorporates all of the core functionality planned, and additionallty has new visualisation and data integration tools that make the platform very adaptable and useful. The tool is already in use by a number of research groups across the UK.
AcronymAfterParty
StatusFinished
Effective start/end date1/08/1131/01/13

Funding

  • BBSRC: £154,936.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.
  • AfterParty

    Blaxter, M. (Photographer) & Jones, M. (Photographer), 1 Jan 2013

    Research output: Non-textual formWeb publication/site