Perl for Bioinformatics: Day 2 – querying a database (DBI)

So in Day 1, we learned how to use Perl to parse a file. Today we are going to learn how to extract information from a database.

A database is an organised collection of data. Since lots of Bioinformatics resources store their data in a database, it’s pretty useful to find out early on how to go about using them.

There are lots of different types of databases (e.g. MySQL, PostgreSQL, Oracle) and each of them has slight differences in the way that you interect with them. To make life easier, the good people of Perl have written a library called DBI that provides a common way of accessing them (feel free to have a good look around the DBI documentation on CPAN and come back when you’re ready).

Accessing a database with DBI

The following script provides a very simple example of how you might go about using DBI libary to extract data from your database. We are extracting OMIM data from one of our local Oracle databases, but you should be able to see how it can be extended to your own situation.

Note: you’ll need to ask your database administrator for suitable values to replace ‘??????’

#!/usr/bin/env perl

use strict;
use warnings;

use DBI;

# information that we need to specify to connect to the database
my $dsn         = "dbi:Oracle:host=?????;sid=?????";  # what type of database (Oracle) and where to find it (sinatra)
my $db_username = "?????";                            # we connect as a particular user
my $db_password = "?????";                            # with a password

# connect to the database
my $gene3d_dbh = DBI->connect( $dsn, $db_username, $db_password )
	or die "! Error: failed to connect to database";

# this is the query that will get us the data
my $omim_sql = <<"_SQL_";
SELECT
	OMIM_ID, UNIPROT_ACC, RESIDUE_POSITION, NATIVE_AA, MUTANT_AA, VALID, DESCRIPTION, NATIVE_AA_SHORT
FROM
	gene3d_12.omim
WHERE
	valid = 't'
_SQL_

# prepare the SQL (returns a "statement handle")
my $omim_sth = $gene3d_dbh->prepare( $omim_sql )
	or die "! Error: encountered an error when preparing SQL statement:\n"
		. "ERROR: " . $gene3d_dbh->errstr . "\n"
		. "SQL:   " . $omim_sql . "\n";

# execute the SQL
$omim_sth->execute
	or die "! Error: encountered an error when executing SQL statement:\n"
		. "ERROR: " . $omim_sth->errstr . "\n"
		. "SQL:   " . $omim_sql . "\n";

# go through each row
while ( my $omim_row = $omim_sth->fetchrow_hashref ) {
	printf "%-10s %-10s %-10s %-10s %-10s %s\n",
		$omim_row->{OMIM_ID},
		$omim_row->{UNIPROT_ACC},
		$omim_row->{RESIDUE_POSITION},
		$omim_row->{MUTANT_AA},
		$omim_row->{NATIVE_AA},
		$omim_row->{DESCRIPTION}
		;
}

This prints out:

100650     P05091     504        LYS        GLU        ALCOHOL SENSITIVITY - ACUTE ALCOHOL DEPENDENCE - PROTECTION AGAINST - INCLUDED;; HANGOVER - SUSCEPTIBILITY TO - INCLUDED;; SUBLINGUAL NITROGLYCERIN - SUSCEPTIBILITY TO POOR RESPONSE TO - INCLUDED;; ESOPHAGEAL CANCER - ALCOHOL-RELATED - SUSCEPTIBILITY TO - INCLUDED ALDH2 - GLU504LYS (dbSNP rs671)
100690     P02708     262        LYS        ASN        MYASTHENIC SYNDROME - CONGENITAL - SLOW-CHANNEL CHRNA1 - ASN217LYS
100690     P02708     201        MET        VAL        MYASTHENIC SYNDROME - CONGENITAL - SLOW-CHANNEL CHRNA1 - VAL156MET
...

Improvements

The first thing to notice was that this was quite a lot of typing: writing out the SQL, setting up database handles/statement handles, checking return values, printing out decent error messages, etc. Lots of typing means lots of code to maintain and far more chance of repeating yourself (which you really shouldn’t be doing).

When faced with the prospect of lots of typing, any decent (i.e. lazy) programmer will be instantly thinking about how they can avoid it: what shortcuts they can make, what libraries they can reuse. As luck would have it the good people of Perl have already thought of this and come up with DBIx::Class which will be the basis of a future post.

Discussion

There is a lot of value in understanding how raw DBI works. However, when you start writing and maintaining your own code, there is a huge amount of value in using a library (such as DBIx::Class) that builds on DBI and helps to keep you away from intereacting with DBI directly.

Oracle SQL Developer v 4.0.0 – new GUI for multi-db management

Oracle have released a new version of their GUI database development tool: Oracle SQL Developer v 4.0.0.
This tool is provided free (it’s the various database engine flavours for which Oracle charge) and provides a wealth of features, such as browsing/manipulating database objects, writing/debugging/running SQL queries, managing security, comparing databases and running a large range of reports.

Particularly helpful is the ability to connect to a number of different database engines, in addition to Oracle; so far I have successfully connected to MySQL and PostgreSQL, but any database with compliant JDBC drivers (Java Database Connectivity) should work.

As SQL Developer is a pretty hefty tool, I’ll mention two topics here, which are helpful to have sorted early on:

  1. How to connect to a PostgreSQL database (or other non-Oracle db)
  2. How to avoid JavaVM errors due to heap space

Connecting to a PostgreSQL database

  1. Download the JDBC driver from: http://jdbc.postgresql.org/download.html
    Use this version: JDBC41 Postgresql Driver, Version 9.3-1100 as it is compatible with JVM 1.7, used in SQL Developer 4.
  2. In SQL Developer 4, link to the JDBC driver via ‘Tools’, ‘Preferences’, ‘Add Entry’ as below:Oracle_connect_1
  3.  Now create a new connection (via ‘File’ / ‘New’ or the green cross in the ‘Connections’ pane).  If you have successfully linked  the JDBC library, you will have a new tab ‘PostgreSQL’ on the ‘New Connection’ dialogue.  Give the connection a name, set your database username & password and you should be ready to go…Oracle_connect_2

Increasing memory available to JavaVM

You need to edit the file ide.conf – this will be under the installation folder:

../../sqldeveloper/ide/bin/ide.conf

Find the following lines (probably at the bottom) and increase the memory available, for example:

// The maximum memory the JVM can take up.
AddVMOption -Xmx2048M  

// The initial memory available on startup.
AddVMOption -Xms512M 

***

In a future post I hope to look at database migration – there is a comprehensive wizard to allow migration of existing MySQL or PostgreSQL databases to Oracle. When I understand the options and get it working fully, I’ll let you know!

Please note this is an early version (4.0.0.13) of the SQL Developer tool and there may well be odd bugs and quirks until it reaches a more mature release…