Perl for Bioinformatics: Day 2 – querying a database (DBI)

So in Day 1, we learned how to use Perl to parse a file. Today we are going to learn how to extract information from a database.

A database is an organised collection of data. Since lots of Bioinformatics resources store their data in a database, it’s pretty useful to find out early on how to go about using them.

There are lots of different types of databases (e.g. MySQL, PostgreSQL, Oracle) and each of them has slight differences in the way that you interect with them. To make life easier, the good people of Perl have written a library called DBI that provides a common way of accessing them (feel free to have a good look around the DBI documentation on CPAN and come back when you’re ready).

Accessing a database with DBI

The following script provides a very simple example of how you might go about using DBI libary to extract data from your database. We are extracting OMIM data from one of our local Oracle databases, but you should be able to see how it can be extended to your own situation.

Note: you’ll need to ask your database administrator for suitable values to replace ‘??????’

#!/usr/bin/env perl

use strict;
use warnings;

use DBI;

# information that we need to specify to connect to the database
my $dsn         = "dbi:Oracle:host=?????;sid=?????";  # what type of database (Oracle) and where to find it (sinatra)
my $db_username = "?????";                            # we connect as a particular user
my $db_password = "?????";                            # with a password

# connect to the database
my $gene3d_dbh = DBI->connect( $dsn, $db_username, $db_password )
	or die "! Error: failed to connect to database";

# this is the query that will get us the data
my $omim_sql = <<"_SQL_";
SELECT
	OMIM_ID, UNIPROT_ACC, RESIDUE_POSITION, NATIVE_AA, MUTANT_AA, VALID, DESCRIPTION, NATIVE_AA_SHORT
FROM
	gene3d_12.omim
WHERE
	valid = 't'
_SQL_

# prepare the SQL (returns a "statement handle")
my $omim_sth = $gene3d_dbh->prepare( $omim_sql )
	or die "! Error: encountered an error when preparing SQL statement:\n"
		. "ERROR: " . $gene3d_dbh->errstr . "\n"
		. "SQL:   " . $omim_sql . "\n";

# execute the SQL
$omim_sth->execute
	or die "! Error: encountered an error when executing SQL statement:\n"
		. "ERROR: " . $omim_sth->errstr . "\n"
		. "SQL:   " . $omim_sql . "\n";

# go through each row
while ( my $omim_row = $omim_sth->fetchrow_hashref ) {
	printf "%-10s %-10s %-10s %-10s %-10s %s\n",
		$omim_row->{OMIM_ID},
		$omim_row->{UNIPROT_ACC},
		$omim_row->{RESIDUE_POSITION},
		$omim_row->{MUTANT_AA},
		$omim_row->{NATIVE_AA},
		$omim_row->{DESCRIPTION}
		;
}

This prints out:

100650     P05091     504        LYS        GLU        ALCOHOL SENSITIVITY - ACUTE ALCOHOL DEPENDENCE - PROTECTION AGAINST - INCLUDED;; HANGOVER - SUSCEPTIBILITY TO - INCLUDED;; SUBLINGUAL NITROGLYCERIN - SUSCEPTIBILITY TO POOR RESPONSE TO - INCLUDED;; ESOPHAGEAL CANCER - ALCOHOL-RELATED - SUSCEPTIBILITY TO - INCLUDED ALDH2 - GLU504LYS (dbSNP rs671)
100690     P02708     262        LYS        ASN        MYASTHENIC SYNDROME - CONGENITAL - SLOW-CHANNEL CHRNA1 - ASN217LYS
100690     P02708     201        MET        VAL        MYASTHENIC SYNDROME - CONGENITAL - SLOW-CHANNEL CHRNA1 - VAL156MET
...

Improvements

The first thing to notice was that this was quite a lot of typing: writing out the SQL, setting up database handles/statement handles, checking return values, printing out decent error messages, etc. Lots of typing means lots of code to maintain and far more chance of repeating yourself (which you really shouldn’t be doing).

When faced with the prospect of lots of typing, any decent (i.e. lazy) programmer will be instantly thinking about how they can avoid it: what shortcuts they can make, what libraries they can reuse. As luck would have it the good people of Perl have already thought of this and come up with DBIx::Class which will be the basis of a future post.

Discussion

There is a lot of value in understanding how raw DBI works. However, when you start writing and maintaining your own code, there is a huge amount of value in using a library (such as DBIx::Class) that builds on DBI and helps to keep you away from intereacting with DBI directly.

Perl for Bioinformatics: Day 1 – parsing a file

You don’t have to hang around too long in a Bioinformatics lab before someone asks you to parse data from a <insert your favourite data format here> file. Since we’ve just had some people join the lab who are new to coding – parsing a file seemed a good place to start.

The following is intended as a “Day 1″ introduction to a typical Bioinformatics task in Perl.

Caveats

Some things to take into account before we start:

  1. It’s very likely that somebody, somewhere has already written a parser for your favourite data format. It’s also likely that they’ve already gone through the pain of dealing with edge cases that you aren’t aware of. You should really consider using their code or at least looking at how it works. If you’re writing in Perl (and in this case, we are) then you should have a rummage around CPAN (http://www.cpan.org) and BioPerl (http://www.bioperl.org).
  2. The following script is not intended as an example of “best practice” code – the intention here is to keep things simple and readable.

Getting the data

Okay so it’s our first day and we’ve just been asked to do the following:

Parse “genemap” data from OMIM

Err.. genemap? OMIM? If in doubt, the answer is nearly always the same: Google Is Your Friend.

Googling “download OMIM” get us what we want. Now we have just have to read the instructions, follow the instructions, fill in the forms, direct your web browser at the link that gets sent in an email, download the data via your web browser.

If you get stuck, don’t be afraid to ask – either the person sitting next to you or by emailing the “contact” section of the website you’re using. However, also remember that you are here to do research – and a lot of that comes down to rummaging around, trying to figure stuff out for yourself.

It’s really useful to keep things tidy so we’re going to create a local directory for this project by typing the following into a terminal (note: lines that start with ‘#’ are comments, stuff that comes after the ‘>’ are linux commands).

# go to my home directory
> cd ~/

# create a directory that we're going to work from
> mkdir omim_project

# move into to this new directory
> cd omim_project

# create a directory for the data 
# note: the date we downloaded the data will definitely be useful to know
> mkdir omim_data.2014_09_16

# look for the files we've just downloaded
> ls -rt ~/Downloads

# copy the ones we want into our data directory
> cp ~/Downloads/genemap ./omim_data.2014_09_16

Step 1. Setting up the script

Now we can write our first Perl script which is going to parse this file – i.e. extract the data from the text file, organise the data into a meaningful structure, output the information we need.

There are loads of different text editors you can use – I’m assuming you have access to ‘kate’.

# open up 'kate' with a new file for our script called 'parse_genemap.pl'
> kate parse_genemap.pl

Here’s the first bit of code – we’ll go through it line by line.

#!/usr/bin/env perl

use strict;
use warnings;

use File::Basename qw/ basename /;

# 'basename' is imported from File::Basename
my $PROGNAME = basename( $0 );

my $USAGE =<<"_USAGE";
usage: $PROGNAME <genemap_file>

Parses OMIM "genemap" file

_USAGE

my $genemap_filename = shift @ARGV or die "$USAGE";

Line 1 (called ‘shebang’) tells the linux terminal that we want this file to be run as a Perl script.

#!/usr/bin/env perl

The next commands make sure that we find out straight away if we’ve made any mistakes in our code. It’s generally a good thing for our programs to “die early and loudly” as soon as a problem happens. This makes debugging much easier when things get more complicated.

use strict;
use warnings;

The following command imports a function ‘basename’ that we’ll use to get the name of the current script.

use File::Basename qw/ basename /;

Note: you can find out lots more about what a module does by entering the following into a terminal:

perldoc File::Basename

Perl put lots of useful variables into special variables. To get the full path of the script we are currently running, we can use ‘$0′.

This is what Perl’s documentation pages have to say about it:

$0
Contains the name of the program being executed.

Feeding this into ‘basename’ will take the directory path off the script and just leave us with the script name (i.e. ‘parse_genemap.pl’). This is handy when we want to provide a simple note on how this script should be run.

# 'basename' is imported from File::Basename
my $PROGNAME = basename( $0 );

my $USAGE =<<"_USAGE";
usage: $PROGNAME <genemap_file>

Parses OMIM "genemap" file

_USAGE

Step 2. Gather data from the command line

We’ve set this program up to take a single argument on the command line which will be the location of the ‘genemap’ file to parse. This gives us some flexibility if we want to parse different genemap files, or if the genemap files are likely to move around in the file system.

The arguments on the command line are stored in another special variable called ‘@ARGV’. The ‘@’ symbol means this is an array (or set of values) rather than a single value. We’ll use the built-in function ‘shift’ to get the first command line argument from that list.

my $genemap_filename = shift @ARGV or die "$USAGE";

If the list is empty then it means we’ve run the script without any arguments. If this happens we want to end the progam with a useful message on what the script is doing and how it should be run.

Step 3. Reading the data

The following creates a “file handle” that can be used for reading and writing to a file. There are lots of ways of creating file handles in Perl (I suggest looking at ‘Path::Class’).

# create a file handle that we can use to input the contents of
# the genemap file
# (and complain if there's a problem)
# note: '<' means "input from this file" in linux shells

open( my $genemap_fh, '<', $genemap_filename )
or die "! Error: failed to open file $genemap_filename: $!";

Again, if there’s a problem (e.g. the file we are given doesn’t exist) then we want to know about it straight away with a sensible error message.

Now we are going to read the file line-by-line and create a data structure for each row. Most of the following code is just made up of comments.


# create an array that will contain our genemap entries
my @genemap_entries;

# go through the file line by line
while( my $line = $genemap_fh->getline ) {

  # an example line from file 'genemap' looks like:
  # 1.1|5|13|13|1pter-p36.13|CTRCT8, CCV|P|Cataract, congenital, Volkmann type||115665|Fd|linked to Rh in Scottish family||Cataract 8, multiple types (2)| | ||

  # the keys for each column are specified in 'genemap.key':
  # 1  - Numbering system, in the format  Chromosome.Map_Entry_Number
  # 2  - Month entered
  # 3  - Day     "
  # 4  - Year    "
  # 5  - Cytogenetic location
  # 6  - Gene Symbol(s)
  # 7  - Gene Status (see below for codes)
  # 8  - Title
  # 9  - Title, cont.
  # 10 - MIM Number
  # 11 - Method (see below for codes)
  # 12 - Comments
  # 13 - Comments, cont.
  # 14 - Disorders (each disorder is followed by its MIM number, if
  #      different from that of the locus, and phenotype mapping method (see
  #      below).  Allelic disorders are separated by a semi-colon.
  # 15 - Disorders, cont.
  # 16 - Disorders, cont.
  # 17 - Mouse correlate
  # 18 - Reference

  # split up the line based on the '|' character
  # note: we use '\|' since writing '|' on its own has a special meaning
  my @cols = split /\|/, $line;

  # create a HASH / associative array to provide labels for these values
  # note: arrays start from '0' so we take one away from the columns mentioned above
  my %genemap_entry = (
    id                 => $cols[0],
    month_entered      => $cols[1],
    day_entered        => $cols[2],
    year_entered       => $cols[3],
    date_entered       => "$cols[2]-$cols[1]-$cols[3]",   # "Day-Month-Year"
    cytogenic_location => $cols[5],
    gene_symbol        => $cols[6],
    # add more labels for the rest of the columns
  );

  # put a *reference* to this HASH onto our growling array of entries
  push @genemap_entries, \%genemap_entry;
}

It’s really important to add useful comments into your code. Not just what you are doing, but why you are doing it. In a few months time, you won’t remember any of this and if you don’t put these comments in, you’ll need to figure it out all over again.

Step 5. Process the data

Usually we would want to do something interesting with the data – such as filter out certain rows, sort these entries, etc. This would be a good place to do it, but we’ll save that for a different day.

Step 6. Output the data

We’re going to check that everything has done okay by simply printing out the entries that we’ve parsed from the file. Again, the code has lots of comments so I won’t go through it line by line.

# note: the following section is going to print out the following:
#
#   1.1    13-5-13          CTRCT8, CCV
#   1.2    25-9-01      ENO1, PPH, MPB1
#   1.3   22-12-87          ERPL1, HLM2
#   ...        ...                  ...
# 24.51    25-8-98        GCY, TSY, STA
# 24.52    20-3-08                DFNY1
# 24.53     8-2-01                  RPY
#
# Number of Genemap Entries: 15037
#

# go through these entries one by one...
foreach my $gm_entry ( @genemap_entries ) {
# we can use the keys that we defined when creating the HASH
# to access the values for each entry in a meaningful way
# note: $gm_entry is a HASH *reference*
#       to access the data in the HASH: $gm_entry->
printf "%5s %10s %20s\n", $gm_entry->{ id }, $gm_entry->{ date_entered }, $gm_entry->{ cytogenic_location };
}

print "\n"; # new line
print "Number of Genemap Entries: ", scalar( @genemap_entries ), "\n";
print "\n";

All done.

Here’s the listing of the program in full:

 

#!/usr/bin/env perl

use strict;
use warnings;

use File::Basename qw/ basename /;

# 'basename' is imported from File::Basename
my $PROGNAME = basename( $0 );

my $USAGE =<<"_USAGE";
usage: $PROGNAME <genemap_file>

Parses OMIM "genemap" file

_USAGE

my $genemap_filename = shift @ARGV or die "$USAGE";

# create a file handle that we can use to input the contents of
# the genemap file
# (and complain if there's a problem)
# note: '<' means "input from this file" in linux shells

open( my $genemap_fh, '<', $genemap_filename )
or die "! Error: failed to open file $genemap_filename: $!";

# create an array that will contain our genemap entries
my @genemap_entries;

# go through the file line by line
while( my $line = $genemap_fh->getline ) {

  # an example line from file 'genemap' looks like:
  # 1.1|5|13|13|1pter-p36.13|CTRCT8, CCV|P|Cataract, congenital, Volkmann type||115665|Fd|linked to Rh in Scottish family||Cataract 8, multiple types (2)| | ||

  # the keys for each column are specified in 'genemap.key':
  # 1  - Numbering system, in the format  Chromosome.Map_Entry_Number
  # 2  - Month entered
  # 3  - Day     "
  # 4  - Year    "
  # 5  - Cytogenetic location
  # 6  - Gene Symbol(s)
  # 7  - Gene Status (see below for codes)
  # 8  - Title
  # 9  - Title, cont.
  # 10 - MIM Number
  # 11 - Method (see below for codes)
  # 12 - Comments
  # 13 - Comments, cont.
  # 14 - Disorders (each disorder is followed by its MIM number, if
  #      different from that of the locus, and phenotype mapping method (see
  #      below).  Allelic disorders are separated by a semi-colon.
  # 15 - Disorders, cont.
  # 16 - Disorders, cont.
  # 17 - Mouse correlate
  # 18 - Reference

  # split up the line based on the '|' character
  # note: we use '\|' since writing '|' on its own has a special meaning
  my @cols = split /\|/, $line;

  # create a HASH / associative array to provide labels for these values
  # note: arrays start from '0' so we take one away from the columns mentioned above
  my %genemap_entry = (
    id                 => $cols[0],
    month_entered      => $cols[1],
    day_entered        => $cols[2],
    year_entered       => $cols[3],
    date_entered       => "$cols[2]-$cols[1]-$cols[3]",   # "Day-Month-Year"
    cytogenic_location => $cols[5],
    gene_symbol        => $cols[6],
    # add more labels for the rest of the columns
  );

  # put a *reference* to this HASH onto our growling array of entries
  push @genemap_entries, \%genemap_entry;
}

# note: the following section is going to print out the following:
#
#   1.1    13-5-13          CTRCT8, CCV
#   1.2    25-9-01      ENO1, PPH, MPB1
#   1.3   22-12-87          ERPL1, HLM2
#   ...        ...                  ...
# 24.51    25-8-98        GCY, TSY, STA
# 24.52    20-3-08                DFNY1
# 24.53     8-2-01                  RPY
#
# Number of Genemap Entries: 15037
#

# go through these entries one by one...
foreach my $gm_entry ( @genemap_entries ) {
# we can use the keys that we defined when creating the HASH
# to access the values for each entry in a meaningful way
# note: $gm_entry is a HASH *reference*
#       to access the data in the HASH: $gm_entry->
printf "%5s %10s %20s\n", $gm_entry->{ id }, $gm_entry->{ date_entered }, $gm_entry->{ cytogenic_location };
}

# let people know how many entries we've processed
print "\n"; # new line
print "Number of Genemap Entries: ", scalar( @genemap_entries ), "\n";
print "\n";

Prof Christine Orengo elected as member of EMBO

We are very pleased to announce that Prof Christine Orengo has been elected as a member of the European Molecular Biology Organisation (EMBO). EMBO is an organisation that promotes excellence across all aspects of the life sciences through courses, workshops, conferences and publications.

Prof. Orengo was one of 106 “outstanding researchers in the life sciences” that were elected to be EMBO members in 2014.

EMBO Director, Maria Leptin, spoke about the strategic decision to expand the scope of the membership and encourage collaborations across traditional scientific divides, “Great leaps in scientific progress often arise when fundamental approaches like molecular biology are applied to previously unconsidered or emerging disciplines. Looking forward, we want to ensure that all communities of the life sciences benefit from this type of cross-pollination.”

amazon cloud

Theres a nice new tool for using the Amazon Cloud:

http://star.mit.edu/cluster/index.html

 

The same group also have a PDB viewer:

http://star.mit.edu/biochem/index.html

Making Sense of Genomic Data: Where are we in the function annotation race?

dna_search

Due to the increasing genome-sequencing initiatives worldwide and the cheaper associated costs, a huge amount of genomic data is now accumulating. In contrast, the functions of only around 1% of these sequences are currently known from experimental studies. The gap between unannotated (sequences whose function is not known) and annotated sequences will continue to rise further since the experimental functional characterisation of such large amounts of genomic data is not feasible. In order to bridge this widening gap, computational function prediction approaches will be essential.

The information encoded in the genome is translated into proteins which carry out the biological functions required for proper functioning of a cell. They are made up of a linear chain of amino acids (determined by the nucleotide sequence in genes) linked together by peptide bonds and they can be as diverse as the functions they serve. Depending on their amino-acid composition and sequence, proteins can fold into their native three-dimensional conformation, which allows them to interact with other proteins or molecules and perform their function. Proteins are often considered as the ‘workhorse’ molecules of the cell and they can perform diverse functions – as biological catalysts, structural elements, carrier molecules or roles in cell signalling and cellular metabolism amongst others. As a result, in order to have a better understanding the cell at the molecular level and ‘decode’ the available genomic data, it is essential to characterize protein functions.

The functional role of a protein can be studied or described in many different ways – by the molecular function or biochemical activity of the protein, its role in a biological process or its relatedness to a disease. Hence, the term ‘protein function’ can be very ambiguous unless the context in which the function of a protein is described is stated clearly. Protein function descriptions written in the natural language used in the literature have been found to be too vague and unspecific to accurately describe the function of proteins; this has led to the subsequent development of a common organized protein annotation vocabulary – the Gene Ontology. This is the largest and the most widely used resource of protein functions which can be used to assign functions to proteins using different contexts irrespective of the source organism.

The conventional method used to predict protein function is a protein sequence (or structure) homology search to identify similar sequences from a protein sequence (or structure) database, followed by extrapolating from the known functions of the most similar sequence (or structure). This is based on the principle that evolutionarily related proteins having high sequence (or structure) similarity have similar, if not identical functions. However, this approach is error prone in cases when the protein in question is substantially different from any other protein in the database with a known function; when proteins do not follow the simple linear relationship between protein similarity and function; or when proteins ‘moonlight’.

Typically, function prediction methods are based on the assumption that the more similar the proteins (based on sequence or structure), the more alike their function. However, in reality, in some cases minor variations in protein sequences or structures can lead to a substantial change in molecular function and sometimes very different proteins can perform similar or identical functions.

Moonlighting proteins pose additional challenges for function prediction methods since, without any significant change in their sequence or structure, they are capable of carrying out more than one diverse functions based on where they are localized or their concentration. Phosphoglucose isomerase is one such ‘moonlighting’ protein which functions as a cell-metabolism enzyme inside the cell and as a nerve growth factor outside the cell.

In order to predict functions of uncharacterised genomic sequences, most of the recent function prediction methods combine several sequence-based and structure-based methods using machine-learning approaches, since most of them are hard to characterise using a single method. Many methods are available today which provide computational function predictions exploiting different approaches. However, it is essential for experimental biologists to understand whether the vast number of function prediction methods are of any value to them and whether the function predictions made by them can be relied upon. The Critical Assessment of Function Annotation (CAFA) experiment is one such major bioinformatics initiative; this aims to provide an unbiased large-scale assessment of protein function prediction methods: to decide which method performs better and to understand the ability of the whole field in providing function predictions for the colossal amounts of genomic data currently available.

The first CAFA experiment in 2013 was successful in providing an understanding of the performance of existing function prediction methods. At the same time, it also highlighted the major challenges and limitations of automated function prediction for computational biologists, database curators and experimental biologists. One of the main challenges is the importance of having accurate experimentally known functions of characterized proteins in databases – since all methods can only make predictions based on the available protein function data. However, significant biases have been recently identified in databases due to the recent increase in use of high-throughput experiments which contribute only very general functions towards experimental protein annotations. As a result of this, currently almost 25% of the characterized proteins in public databases have ‘binding’ listed as their function. Moreover, experimentally known functions for most proteins are incomplete, as the experiments are biased by experimenter choice and the annotations are limited by the scope of experiments. The existence of these biases not only affects our understanding of protein function in general but also affects the relationship between protein function prediction methods and the predicted function. Hence, it is essential for both developers of automated function prediction methods and experimental biologists who use computational function annotations to guide their experiments to be aware of the existence of such experimental biases.

Oracle SQL Developer v 4.0.0 – new GUI for multi-db management

Oracle have released a new version of their GUI database development tool: Oracle SQL Developer v 4.0.0.
This tool is provided free (it’s the various database engine flavours for which Oracle charge) and provides a wealth of features, such as browsing/manipulating database objects, writing/debugging/running SQL queries, managing security, comparing databases and running a large range of reports.

Particularly helpful is the ability to connect to a number of different database engines, in addition to Oracle; so far I have successfully connected to MySQL and PostgreSQL, but any database with compliant JDBC drivers (Java Database Connectivity) should work.

As SQL Developer is a pretty hefty tool, I’ll mention two topics here, which are helpful to have sorted early on:

  1. How to connect to a PostgreSQL database (or other non-Oracle db)
  2. How to avoid JavaVM errors due to heap space

Connecting to a PostgreSQL database

  1. Download the JDBC driver from: http://jdbc.postgresql.org/download.html
    Use this version: JDBC41 Postgresql Driver, Version 9.3-1100 as it is compatible with JVM 1.7, used in SQL Developer 4.
  2. In SQL Developer 4, link to the JDBC driver via ‘Tools’, ‘Preferences’, ‘Add Entry’ as below:Oracle_connect_1
  3.  Now create a new connection (via ‘File’ / ‘New’ or the green cross in the ‘Connections’ pane).  If you have successfully linked  the JDBC library, you will have a new tab ‘PostgreSQL’ on the ‘New Connection’ dialogue.  Give the connection a name, set your database username & password and you should be ready to go…Oracle_connect_2

Increasing memory available to JavaVM

You need to edit the file ide.conf - this will be under the installation folder:

../../sqldeveloper/ide/bin/ide.conf

Find the following lines (probably at the bottom) and increase the memory available, for example:

// The maximum memory the JVM can take up.
AddVMOption -Xmx2048M  

// The initial memory available on startup.
AddVMOption -Xms512M 

***

In a future post I hope to look at database migration – there is a comprehensive wizard to allow migration of existing MySQL or PostgreSQL databases to Oracle. When I understand the options and get it working fully, I’ll let you know!

Please note this is an early version (4.0.0.13) of the SQL Developer tool and there may well be odd bugs and quirks until it reaches a more mature release…

 

migrating home directory files using dolphin

This is one convenient way you can start to move your old home directory files over to new home directories:

1) If you havent got ian to activate your UCL account then do so.
2) login to your machine as your ucl user
3) start dolphin &
4) right click on the ‘places’ column of the dolphin window
and click ‘add entry’
5) in the resulting pop-up window put in the remote location of your home directory:
e.g.
label: old_home
Location: sftp://username@bsmlx**/home/bsm3/lees
substitute username and ** as appropriate

6) click split at the top of the dolphin window to give you a split screen
Then click on your new home directory in the places directory and
start copying things over.

Pre-processing metagenomic data

A vital part of metagenome analysis is first ensuring that you are working with good quality data. During my PhD I have used the tools PRINSEQ and DeconSeq (Schmieder and Edwards, 2011) to process and filter metagenomic data sequenced using the 454 pyrosequencing technology.

PRINSEQ generates general statistics, which can also indicate data quality. Data can be filtered based upon a number of criteria including: the removal of duplicate sequences, the removal of sequence containing x% ambiguous (i.e. “n”) bases, and the removal of sequences with a mean quality score of x. Sequences can be trimmed to a defined length, to exclude poor quality bases, and to trim poly-A/T tails. Finally, your output data can be re-formatted between FASTA+QUAL and FASTQ formats, between DNA and RNA sequences, and sequence IDs can be changed.

DeconSeq is typically used after PRINSEQ and involves the removal of ‘contaminate’ sequences, i.e. sequences that belong to species you do not wish to analyse. Data is scanned, using BLAST, against a reference database of the contaminant species of choice to remove any matching sequences.

Both of these tools are free to use and are available either online or through a standalone version.

-

Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27:863-864. [PMID: 21278185]

Coping with millions of small files: appending to a tar archive

Most file systems will struggle when you read and write millions of tiny files. Exactly how much a particular file system will struggle will depend on a bunch of factors: the formatting of the file system (ext3, gpfs, etc), hardware configurations (RAM / networking bottlenecks) and so on. However, the take home message is that storing many millions of tiny files on standard file systems (especially network file systems) is going to cause performance problems.

We recently came across this when performing millions of structural comparisons (with SSAP). Each comparison results in a small file, so running 25 million of these comparisons on the HPC caused a number of problems with storage.

The solution? Well, as always, there are lots.

You could store the file contents in a database (rather than a file system). The downside being that this requires the overhead of running the database: dependencies, concurrency issues when accessing from many places at the same time and generally more tools in the chain to go wrong.

Since we’re running this in an HPC environment, we’ll want a solution that is simple, scales well, and requires few dependencies. A possible alternative would be to store all these small files in a single large archive.

Looking at the tar man page, we can create a new tar archive with the ‘-c’ flag:

$ tar -cvf my_archive.tar existing_file

and we can append extra files to that archive with the ‘-r’ flag:

$ tar -rvf my_archive.tar extra_file1
$ tar -rvf my_archive.tar extra_file2

you can then list the contents of that archive with the ‘-t’ flag:

$ tar -t my_archive.tar
existing_file
extra_file1
extra_file2

So, looking at a real example…

Let’s say we split our job of 25 million pairwise comparisons into 2500 lists, each containing 10000 pairs:

$ split --lines 10000 -a 4 -d ssap_pairs.all ssap_pairs.

That will result in a 2500 files, each containing 10000 lines (called ssap_pairs.0001, ssap_pairs.0002, …).

We can then submit a job array to the scheduler so that a single script can process each of these input files.

$ qsub -t 1-2500 ssap.submit.pbs

Rather than post the full contents of the script ‘ssap.submit.pbs’ here – I’ll just focus on the loop that creates these 10000 small alignment files:

# name the archive file
ALN_TAR_FILE=ssap_alignments.`printf "%04d" $SGE_TASK_ID`.tgz

# create the archive (start the archive off with an existing file)
tar -cvf $ALN_TAR_FILE ssap_pairs.all

while read pairs; do
    # name of the alignment file that SSAP will produce
    # (e.g. 1abcA001defA00.list)
    aln_file=`echo "$pairs" | tr -d " "`.list

    # run SSAP
    $BIN_DIR/SSAP $pairs >> ssap_results.out

    # append alignment file to archive
    tar -rvf $ALN_TAR_FILE $aln_file

    # remove alignment file
    \rm $aln_file

done < $PAIRS_FILE

# compress then copy the archive back to the original working directory
gzip $ALN_TAR_FILE 
cp $ALN_TAR_FILE.gz ${SGE_O_WORKDIR}

Job done.