## Oracle SQL Developer v 4.0.0 – new GUI for multi-db management

Oracle have released a new version of their GUI database development tool: Oracle SQL Developer v 4.0.0.
This tool is provided free (it’s the various database engine flavours for which Oracle charge) and provides a wealth of features, such as browsing/manipulating database objects, writing/debugging/running SQL queries, managing security, comparing databases and running a large range of reports.

Particularly helpful is the ability to connect to a number of different database engines, in addition to Oracle; so far I have successfully connected to MySQL and PostgreSQL, but any database with compliant JDBC drivers (Java Database Connectivity) should work.

As SQL Developer is a pretty hefty tool, I’ll mention two topics here, which are helpful to have sorted early on:

1. How to connect to a PostgreSQL database (or other non-Oracle db)
2. How to avoid JavaVM errors due to heap space

Connecting to a PostgreSQL database

1. Download the JDBC driver from: Use this version: JDBC41 Postgresql Driver, Version 9.3-1100 as it is compatible with JVM 1.7, used in SQL Developer 4.
2. In SQL Developer 4, link to the JDBC driver via ‘Tools’, ‘Preferences’, ‘Add Entry’ as below:
3.  Now create a new connection (via ‘File’ / ‘New’ or the green cross in the ‘Connections’ pane).  If you have successfully linked  the JDBC library, you will have a new tab ‘PostgreSQL’ on the ‘New Connection’ dialogue.  Give the connection a name, set your database username & password and you should be ready to go…

Increasing memory available to JavaVM

You need to edit the file ide.conf – this will be under the installation folder:

../../sqldeveloper/ide/bin/ide.conf

Find the following lines (probably at the bottom) and increase the memory available, for example:

// The maximum memory the JVM can take up.

// The initial memory available on startup.

***

In a future post I hope to look at database migration – there is a comprehensive wizard to allow migration of existing MySQL or PostgreSQL databases to Oracle. When I understand the options and get it working fully, I’ll let you know!

Please note this is an early version (4.0.0.13) of the SQL Developer tool and there may well be odd bugs and quirks until it reaches a more mature release…

## migrating home directory files using dolphin

This is one convenient way you can start to move your old home directory files over to new home directories:

1) If you havent got ian to activate your UCL account then do so.
3) start dolphin &
4) right click on the ‘places’ column of the dolphin window
5) in the resulting pop-up window put in the remote location of your home directory:
e.g.
label: old_home
substitute username and ** as appropriate

6) click split at the top of the dolphin window to give you a split screen
Then click on your new home directory in the places directory and
start copying things over.

## Pre-processing metagenomic data

A vital part of metagenome analysis is first ensuring that you are working with good quality data. During my PhD I have used the tools PRINSEQ and DeconSeq (Schmieder and Edwards, 2011) to process and filter metagenomic data sequenced using the 454 pyrosequencing technology.

PRINSEQ generates general statistics, which can also indicate data quality. Data can be filtered based upon a number of criteria including: the removal of duplicate sequences, the removal of sequence containing x% ambiguous (i.e. “n”) bases, and the removal of sequences with a mean quality score of x. Sequences can be trimmed to a defined length, to exclude poor quality bases, and to trim poly-A/T tails. Finally, your output data can be re-formatted between FASTA+QUAL and FASTQ formats, between DNA and RNA sequences, and sequence IDs can be changed.

DeconSeq is typically used after PRINSEQ and involves the removal of ‘contaminate’ sequences, i.e. sequences that belong to species you do not wish to analyse. Data is scanned, using BLAST, against a reference database of the contaminant species of choice to remove any matching sequences.

Both of these tools are free to use and are available either online or through a standalone version.

Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27:863-864. [PMID: 21278185]

## Coping with millions of small files: appending to a tar archive

Most file systems will struggle when you read and write millions of tiny files. Exactly how much a particular file system will struggle will depend on a bunch of factors: the formatting of the file system (ext3, gpfs, etc), hardware configurations (RAM / networking bottlenecks) and so on. However, the take home message is that storing many millions of tiny files on standard file systems (especially network file systems) is going to cause performance problems.

We recently came across this when performing millions of structural comparisons (with SSAP). Each comparison results in a small file, so running 25 million of these comparisons on the HPC caused a number of problems with storage.

The solution? Well, as always, there are lots.

You could store the file contents in a database (rather than a file system). The downside being that this requires the overhead of running the database: dependencies, concurrency issues when accessing from many places at the same time and generally more tools in the chain to go wrong.

Since we’re running this in an HPC environment, we’ll want a solution that is simple, scales well, and requires few dependencies. A possible alternative would be to store all these small files in a single large archive.

Looking at the tar man page, we can create a new tar archive with the ‘-c’ flag:

$tar -cvf my_archive.tar existing_file and we can append extra files to that archive with the ‘-r’ flag: $ tar -rvf my_archive.tar extra_file1
$tar -rvf my_archive.tar extra_file2 you can then list the contents of that archive with the ‘-t’ flag: $ tar -t my_archive.tar
existing_file
extra_file1
extra_file2

So, looking at a real example…

Let’s say we split our job of 25 million pairwise comparisons into 2500 lists, each containing 10000 pairs:

$split --lines 10000 -a 4 -d ssap_pairs.all ssap_pairs. That will result in a 2500 files, each containing 10000 lines (called ssap_pairs.0001, ssap_pairs.0002, …). We can then submit a job array to the scheduler so that a single script can process each of these input files. $ qsub -t 1-2500 ssap.submit.pbs

Rather than post the full contents of the script ‘ssap.submit.pbs’ here – I’ll just focus on the loop that creates these 10000 small alignment files:

# name the archive file
ALN_TAR_FILE=ssap_alignments.printf "%04d" $SGE_TASK_ID.tgz # create the archive (start the archive off with an existing file) tar -cvf$ALN_TAR_FILE ssap_pairs.all

# name of the alignment file that SSAP will produce
# (e.g. 1abcA001defA00.list)
aln_file=echo "$pairs" | tr -d " ".list # run SSAP$BIN_DIR/SSAP pairs >> ssap_results.out # append alignment file to archive tar -rvfALN_TAR_FILE aln_file # remove alignment file \rmaln_file

done < $PAIRS_FILE # compress then copy the archive back to the original working directory gzip$ALN_TAR_FILE
cp $ALN_TAR_FILE.gz${SGE_O_WORKDIR}


Job done.

## Finding Go semantic similarity

One common problem in bioinformatics  is how to compare the similarity of  the GO term assignments of  two or more proteins.

There are quite a few packages out there. I recently came across this one:

https://sites.google.com/site/fastsemsim/

Its being actively developed and has a particularly large selection of different semantic similarity measures.