Pre-processing metagenomic data

A vital part of metagenome analysis is first ensuring that you are working with good quality data. During my PhD I have used the tools PRINSEQ and DeconSeq (Schmieder and Edwards, 2011) to process and filter metagenomic data sequenced using the 454 pyrosequencing technology.

PRINSEQ generates general statistics, which can also indicate data quality. Data can be filtered based upon a number of criteria including: the removal of duplicate sequences, the removal of sequence containing x% ambiguous (i.e. “n”) bases, and the removal of sequences with a mean quality score of x. Sequences can be trimmed to a defined length, to exclude poor quality bases, and to trim poly-A/T tails. Finally, your output data can be re-formatted between FASTA+QUAL and FASTQ formats, between DNA and RNA sequences, and sequence IDs can be changed.

DeconSeq is typically used after PRINSEQ and involves the removal of ‘contaminate’ sequences, i.e. sequences that belong to species you do not wish to analyse. Data is scanned, using BLAST, against a reference database of the contaminant species of choice to remove any matching sequences.

Both of these tools are free to use and are available either online or through a standalone version.

Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27:863-864. [PMID: 21278185]