Perl for Bioinformatics: Day 2 – querying a database (DBI)

So in Day 1, we learned how to use Perl to parse a file. Today we are going to learn how to extract information from a database.

A database is an organised collection of data. Since lots of Bioinformatics resources store their data in a database, it’s pretty useful to find out early on how to go about using them.

There are lots of different types of databases (e.g. MySQL, PostgreSQL, Oracle) and each of them has slight differences in the way that you interect with them. To make life easier, the good people of Perl have written a library called DBI that provides a common way of accessing them (feel free to have a good look around the DBI documentation on CPAN and come back when you’re ready).

Accessing a database with DBI

The following script provides a very simple example of how you might go about using DBI libary to extract data from your database. We are extracting OMIM data from one of our local Oracle databases, but you should be able to see how it can be extended to your own situation.

Note: you’ll need to ask your database administrator for suitable values to replace ‘??????’

#!/usr/bin/env perl

use strict;
use warnings;

use DBI;

# information that we need to specify to connect to the database
my $dsn         = "dbi:Oracle:host=?????;sid=?????";  # what type of database (Oracle) and where to find it (sinatra)
my $db_username = "?????";                            # we connect as a particular user
my $db_password = "?????";                            # with a password

# connect to the database
my $gene3d_dbh = DBI->connect( $dsn, $db_username, $db_password )
	or die "! Error: failed to connect to database";

# this is the query that will get us the data
my $omim_sql = <<"_SQL_";
SELECT
	OMIM_ID, UNIPROT_ACC, RESIDUE_POSITION, NATIVE_AA, MUTANT_AA, VALID, DESCRIPTION, NATIVE_AA_SHORT
FROM
	gene3d_12.omim
WHERE
	valid = 't'
_SQL_

# prepare the SQL (returns a "statement handle")
my $omim_sth = $gene3d_dbh->prepare( $omim_sql )
	or die "! Error: encountered an error when preparing SQL statement:\n"
		. "ERROR: " . $gene3d_dbh->errstr . "\n"
		. "SQL:   " . $omim_sql . "\n";

# execute the SQL
$omim_sth->execute
	or die "! Error: encountered an error when executing SQL statement:\n"
		. "ERROR: " . $omim_sth->errstr . "\n"
		. "SQL:   " . $omim_sql . "\n";

# go through each row
while ( my $omim_row = $omim_sth->fetchrow_hashref ) {
	printf "%-10s %-10s %-10s %-10s %-10s %s\n",
		$omim_row->{OMIM_ID},
		$omim_row->{UNIPROT_ACC},
		$omim_row->{RESIDUE_POSITION},
		$omim_row->{MUTANT_AA},
		$omim_row->{NATIVE_AA},
		$omim_row->{DESCRIPTION}
		;
}

This prints out:

100650     P05091     504        LYS        GLU        ALCOHOL SENSITIVITY - ACUTE ALCOHOL DEPENDENCE - PROTECTION AGAINST - INCLUDED;; HANGOVER - SUSCEPTIBILITY TO - INCLUDED;; SUBLINGUAL NITROGLYCERIN - SUSCEPTIBILITY TO POOR RESPONSE TO - INCLUDED;; ESOPHAGEAL CANCER - ALCOHOL-RELATED - SUSCEPTIBILITY TO - INCLUDED ALDH2 - GLU504LYS (dbSNP rs671)
100690     P02708     262        LYS        ASN        MYASTHENIC SYNDROME - CONGENITAL - SLOW-CHANNEL CHRNA1 - ASN217LYS
100690     P02708     201        MET        VAL        MYASTHENIC SYNDROME - CONGENITAL - SLOW-CHANNEL CHRNA1 - VAL156MET
...

Improvements

The first thing to notice was that this was quite a lot of typing: writing out the SQL, setting up database handles/statement handles, checking return values, printing out decent error messages, etc. Lots of typing means lots of code to maintain and far more chance of repeating yourself (which you really shouldn’t be doing).

When faced with the prospect of lots of typing, any decent (i.e. lazy) programmer will be instantly thinking about how they can avoid it: what shortcuts they can make, what libraries they can reuse. As luck would have it the good people of Perl have already thought of this and come up with DBIx::Class which will be the basis of a future post.

Discussion

There is a lot of value in understanding how raw DBI works. However, when you start writing and maintaining your own code, there is a huge amount of value in using a library (such as DBIx::Class) that builds on DBI and helps to keep you away from intereacting with DBI directly.