Varnishing all my troubles away

TL;DR

  • Setting up varnish server on brand new server
  • Web page shows: “Error 503 Backend fetch failed”
  • How to investigate the problem (traced back to SELinux)
  • How to actually fix specific SELinux issue (rather than just turn SELinux off)

Varnish?

For our research work at UCL, we host a bunch of different web sites, web services and applications that run on a bunch of different ports on a bunch of different backend machines (and virtual machines). All external web requests arrive on a single IP, and we use varnish to sit on the frontline (port 80) and marshall all the incoming traffic to and from the appropriate backend server.

Varnish is a web accelerator – it sits in front of whatever is actually generating the content for your web pages and caches whatever content it deems safe to cache. The next time someone requests that same page, the content is served from the cache (fast) rather than going off and generating content from the backend (slow). So it’s often used to speed up web pages and generally reduce load on your backend databases and applications.

This is all great, but varnish also provides a really simple and flexible tool for routing HTTP traffic to different backends (which is actually the point of this post).

What’s the problem?

I eventually managed to get round to moving our frontline varnish server from a decaying machine running CentOS 4(!) to a brand new VM running CentOS 7. This allowed varnish to be upgraded from v2.0 to version 4.1.

All good.

I did get stuck with one app that wasn’t working though – the following bit of varnish config was meant to direct traffic through to a backend application, running on a backend server, listening to port 5001.

vcl 4.0;

backend my_app_live {
  .host = "xxx.xxx.xxx.xxx";
  .port = "5001";
}

sub vcl_recv {
  if ( req.http.host == "myapp.domain.com" ) {
    set req.backend_hint = my_app_live;
    return (pass);
  }
}

It was working fine on the old server, but directing my browser to the web address “myapp.domain.com” just gave me the standard Varnish error:

Error 503 Backend fetch failed

Unsurprisingly (given the error message), this message usually happens when varnish has sent a web request to a backend but it hasn’t had any response back from the server. What to do next?

Well let’s start from the backend server and work our way back to varnish…

Can I get the expected web page by sitting on the backend server and contacting the application via the local port?

$ ssh backendserver
$ curl -I http://localhost:5001/ 
HTTP/1.1 200 OK
Date: Fri, 03 Nov 2016 19:42:07 GMT
Server: Apache/2.2.3 (CentOS)
Content-Length: 5504
Connection: close
Content-Type: text/html; charset=utf-8

Yes.

Can I get the expected web page by sitting on the varnish server and contacting the application via a remote port?

$ ssh varnishserver
$ curl -I http://backendserver:5001/ 
HTTP/1.1 200 OK
Date: Fri, 03 Nov 2016 19:45:02 GMT
Server: Apache/2.2.3 (CentOS)
Content-Length: 5504
Connection: close
Content-Type: text/html; charset=utf-8

Yes.

So, what do we know so far…

  • The application seemed to be running fine on the backend server
  • I could retrieve the content directly from the application port (wget)
  • I couldn’t retrieve this content through varnish
  • I could also see that lots of other varnish rerouting was working fine

GIYF

Googling around for issues associated with “varnish” and “503” suggested that the problem might be security settings in SELinux. Which took me to a nice blog post about how to get varnish to play nicely with SELinux.

I should be honest here – for a very long time I considered any problems associated with “SELinux” to have an incredibly strong SEP field. On encountering these problems, my general practice had been to turn SELinux into permissive mode and rely on our main firewall to deal with security issues (ie SEP). As it turns out this practice wasn’t as terrible as it sounds (I checked this with our IT team and they were okay with it). However, turning off security on a brand new externally-facing server left a hacky taste in the mouth.

I figured I should actually do the right thing and learn how to play nicely with SELinux. Turns out it really wasn’t that hard.

Q: Is the problem I’m experiencing related to SELinux?

Good question. Turns out a pretty simple way to find out is looking for a sensible term (eg “varnish”) in the log file:

$ ssh varnishserver
$ sudo grep varnish /var/log/audit/audit.log

This turned up a bunch of lines:

type=AVC msg=audit(1478175339.950:37802): avc: denied { name_connect } for pid=9111 comm="varnishd" dest=5001 scontext=system_u:system_r:varnishd_t:s0 tcontext=system_u:object_r:commplex_link_port_t:s0 tclass=tcp_socket

So, yes – the words “varnish”, “denied” and “dest=5001” definitely did suggest my problem was related to SELinux permissions.

Q: How do I fix my SELinux problem (without just turning the whole thing off)?

Turns out the clever people on the interwebz have written a tool audit2allow to help troubleshoot this kind of thing. This can be installed through the setroubleshoot package (which kind of makes sense).

$ sudo yum install setroubleshoot

This tool can be used to translate the output of the audit log to a more useful message:

$ sudo grep varnishd /var/log/audit/audit.log | audit2allow -w -a

Which provides messages like:

type=AVC msg=audit(1478177584.127:38275): avc: denied { name_connect } for pid=9118 comm="varnishd" dest=5001 scontext=system_u:system_r:varnishd_t:s0 tcontext=system_u:object_r:commplex_link_port_t:s0 tclass=tcp_socket
 Was caused by:
 The boolean varnishd_connect_any was set incorrectly. 
 Description:
 Allow varnishd to connect any

Allow access by executing:
 # setsebool -P varnishd_connect_any 1

So, this was not only telling me in plain text what caused this error, but also how to “fix” it (tell SELinux that this behaviour was fine).

Now of course I read up on exactly what this command was going to do before executing it (no, really I did).

$ sudo setsebool -P varnishd_connect_any 1

So we should just need to restart the varnish server…

$ sudo systemctl restart varnish

check the page again…

$ curl -I www.myservice.com
HTTP/1.1 200 OK

Sorted.

Now I just need to add all this to the puppet configuration…

Perl for Bioinformatics: Day 1 – parsing a file

You don’t have to hang around too long in a Bioinformatics lab before someone asks you to parse data from a <insert your favourite data format here> file. Since we’ve just had some people join the lab who are new to coding – parsing a file seemed a good place to start.

The following is intended as a “Day 1” introduction to a typical Bioinformatics task in Perl.

Caveats

Some things to take into account before we start:

  1. It’s very likely that somebody, somewhere has already written a parser for your favourite data format. It’s also likely that they’ve already gone through the pain of dealing with edge cases that you aren’t aware of. You should really consider using their code or at least looking at how it works. If you’re writing in Perl (and in this case, we are) then you should have a rummage around CPAN (http://www.cpan.org) and BioPerl (http://www.bioperl.org).
  2. The following script is not intended as an example of “best practice” code – the intention here is to keep things simple and readable.

Getting the data

Okay so it’s our first day and we’ve just been asked to do the following:

Parse “genemap” data from OMIM

Err.. genemap? OMIM? If in doubt, the answer is nearly always the same: Google Is Your Friend.

Googling “download OMIM” get us what we want. Now we have just have to read the instructions, follow the instructions, fill in the forms, direct your web browser at the link that gets sent in an email, download the data via your web browser.

If you get stuck, don’t be afraid to ask – either the person sitting next to you or by emailing the “contact” section of the website you’re using. However, also remember that you are here to do research – and a lot of that comes down to rummaging around, trying to figure stuff out for yourself.

It’s really useful to keep things tidy so we’re going to create a local directory for this project by typing the following into a terminal (note: lines that start with ‘#’ are comments, stuff that comes after the ‘>’ are linux commands).

# go to my home directory
> cd ~/

# create a directory that we're going to work from
> mkdir omim_project

# move into to this new directory
> cd omim_project

# create a directory for the data 
# note: the date we downloaded the data will definitely be useful to know
> mkdir omim_data.2014_09_16

# look for the files we've just downloaded
> ls -rt ~/Downloads

# copy the ones we want into our data directory
> cp ~/Downloads/genemap ./omim_data.2014_09_16

Step 1. Setting up the script

Now we can write our first Perl script which is going to parse this file – i.e. extract the data from the text file, organise the data into a meaningful structure, output the information we need.

There are loads of different text editors you can use – I’m assuming you have access to ‘kate’.

# open up 'kate' with a new file for our script called 'parse_genemap.pl'
> kate parse_genemap.pl

Here’s the first bit of code – we’ll go through it line by line.

#!/usr/bin/env perl

use strict;
use warnings;

use File::Basename qw/ basename /;

# 'basename' is imported from File::Basename
my $PROGNAME = basename( $0 );

my $USAGE =<<"_USAGE";
usage: $PROGNAME <genemap_file>

Parses OMIM "genemap" file

_USAGE

my $genemap_filename = shift @ARGV or die "$USAGE";

Line 1 (called ‘hashbang’) tells the linux terminal that we want this file to be run as a Perl script.

#!/usr/bin/env perl

The next commands make sure that we find out straight away if we’ve made any mistakes in our code. It’s generally a good thing for our programs to “die early and loudly” as soon as a problem happens. This makes debugging much easier when things get more complicated.

use strict;
use warnings;

The following command imports a function ‘basename’ that we’ll use to get the name of the current script.

use File::Basename qw/ basename /;

Note: you can find out lots more about what a module does by entering the following into a terminal:

perldoc File::Basename

Perl put lots of useful variables into special variables. To get the full path of the script we are currently running, we can use ‘$0’.

This is what Perl’s documentation pages have to say about it:

$0
Contains the name of the program being executed.

Feeding this into ‘basename’ will take the directory path off the script and just leave us with the script name (i.e. ‘parse_genemap.pl’). This is handy when we want to provide a simple note on how this script should be run.

# 'basename' is imported from File::Basename
my $PROGNAME = basename( $0 );

my $USAGE =<<"_USAGE";
usage: $PROGNAME <genemap_file>

Parses OMIM "genemap" file

_USAGE

Step 2. Gather data from the command line

We’ve set this program up to take a single argument on the command line which will be the location of the ‘genemap’ file to parse. This gives us some flexibility if we want to parse different genemap files, or if the genemap files are likely to move around in the file system.

The arguments on the command line are stored in another special variable called ‘@ARGV’. The ‘@’ symbol means this is an array (or set of values) rather than a single value. We’ll use the built-in function ‘shift’ to get the first command line argument from that list.

my $genemap_filename = shift @ARGV or die "$USAGE";

If the list is empty then it means we’ve run the script without any arguments. If this happens we want to end the progam with a useful message on what the script is doing and how it should be run.

Step 3. Reading the data

The following creates a “file handle” that can be used for reading and writing to a file. There are lots of ways of creating file handles in Perl (I suggest looking at ‘Path::Class’).

# create a file handle that we can use to input the contents of
# the genemap file
# (and complain if there's a problem)
# note: '<' means "input from this file" in linux shells

open( my $genemap_fh, '<', $genemap_filename )
or die "! Error: failed to open file $genemap_filename: $!";

Again, if there’s a problem (e.g. the file we are given doesn’t exist) then we want to know about it straight away with a sensible error message.

Now we are going to read the file line-by-line and create a data structure for each row. Most of the following code is just made up of comments.


# create an array that will contain our genemap entries
my @genemap_entries;

# go through the file line by line
while( my $line = $genemap_fh->getline ) {

  # an example line from file 'genemap' looks like:
  # 1.1|5|13|13|1pter-p36.13|CTRCT8, CCV|P|Cataract, congenital, Volkmann type||115665|Fd|linked to Rh in Scottish family||Cataract 8, multiple types (2)| | ||

  # the keys for each column are specified in 'genemap.key':
  # 1  - Numbering system, in the format  Chromosome.Map_Entry_Number
  # 2  - Month entered
  # 3  - Day     "
  # 4  - Year    "
  # 5  - Cytogenetic location
  # 6  - Gene Symbol(s)
  # 7  - Gene Status (see below for codes)
  # 8  - Title
  # 9  - Title, cont.
  # 10 - MIM Number
  # 11 - Method (see below for codes)
  # 12 - Comments
  # 13 - Comments, cont.
  # 14 - Disorders (each disorder is followed by its MIM number, if
  #      different from that of the locus, and phenotype mapping method (see
  #      below).  Allelic disorders are separated by a semi-colon.
  # 15 - Disorders, cont.
  # 16 - Disorders, cont.
  # 17 - Mouse correlate
  # 18 - Reference

  # split up the line based on the '|' character
  # note: we use '\|' since writing '|' on its own has a special meaning
  my @cols = split /\|/, $line;

  # create a HASH / associative array to provide labels for these values
  # note: arrays start from '0' so we take one away from the columns mentioned above
  my %genemap_entry = (
    id                 => $cols[0],
    month_entered      => $cols[1],
    day_entered        => $cols[2],
    year_entered       => $cols[3],
    date_entered       => "$cols[2]-$cols[1]-$cols[3]",   # "Day-Month-Year"
    cytogenic_location => $cols[5],
    gene_symbol        => $cols[6],
    # add more labels for the rest of the columns
  );

  # put a *reference* to this HASH onto our growling array of entries
  push @genemap_entries, \%genemap_entry;
}

It’s really important to add useful comments into your code. Not just what you are doing, but why you are doing it. In a few months time, you won’t remember any of this and if you don’t put these comments in, you’ll need to figure it out all over again.

Step 5. Process the data

Usually we would want to do something interesting with the data – such as filter out certain rows, sort these entries, etc. This would be a good place to do it, but we’ll save that for a different day.

Step 6. Output the data

We’re going to check that everything has done okay by simply printing out the entries that we’ve parsed from the file. Again, the code has lots of comments so I won’t go through it line by line.

# note: the following section is going to print out the following:
#
#   1.1    13-5-13          CTRCT8, CCV
#   1.2    25-9-01      ENO1, PPH, MPB1
#   1.3   22-12-87          ERPL1, HLM2
#   ...        ...                  ...
# 24.51    25-8-98        GCY, TSY, STA
# 24.52    20-3-08                DFNY1
# 24.53     8-2-01                  RPY
#
# Number of Genemap Entries: 15037
#

# go through these entries one by one...
foreach my $gm_entry ( @genemap_entries ) {
# we can use the keys that we defined when creating the HASH
# to access the values for each entry in a meaningful way
# note: $gm_entry is a HASH *reference*
#       to access the data in the HASH: $gm_entry->
printf "%5s %10s %20s\n", $gm_entry->{ id }, $gm_entry->{ date_entered }, $gm_entry->{ cytogenic_location };
}

print "\n"; # new line
print "Number of Genemap Entries: ", scalar( @genemap_entries ), "\n";
print "\n";

All done.

Here’s the listing of the program in full:

 

#!/usr/bin/env perl

use strict;
use warnings;

use File::Basename qw/ basename /;

# 'basename' is imported from File::Basename
my $PROGNAME = basename( $0 );

my $USAGE =<<"_USAGE";
usage: $PROGNAME <genemap_file>

Parses OMIM "genemap" file

_USAGE

my $genemap_filename = shift @ARGV or die "$USAGE";

# create a file handle that we can use to input the contents of
# the genemap file
# (and complain if there's a problem)
# note: '<' means "input from this file" in linux shells

open( my $genemap_fh, '<', $genemap_filename )
or die "! Error: failed to open file $genemap_filename: $!";

# create an array that will contain our genemap entries
my @genemap_entries;

# go through the file line by line
while( my $line = $genemap_fh->getline ) {

  # an example line from file 'genemap' looks like:
  # 1.1|5|13|13|1pter-p36.13|CTRCT8, CCV|P|Cataract, congenital, Volkmann type||115665|Fd|linked to Rh in Scottish family||Cataract 8, multiple types (2)| | ||

  # the keys for each column are specified in 'genemap.key':
  # 1  - Numbering system, in the format  Chromosome.Map_Entry_Number
  # 2  - Month entered
  # 3  - Day     "
  # 4  - Year    "
  # 5  - Cytogenetic location
  # 6  - Gene Symbol(s)
  # 7  - Gene Status (see below for codes)
  # 8  - Title
  # 9  - Title, cont.
  # 10 - MIM Number
  # 11 - Method (see below for codes)
  # 12 - Comments
  # 13 - Comments, cont.
  # 14 - Disorders (each disorder is followed by its MIM number, if
  #      different from that of the locus, and phenotype mapping method (see
  #      below).  Allelic disorders are separated by a semi-colon.
  # 15 - Disorders, cont.
  # 16 - Disorders, cont.
  # 17 - Mouse correlate
  # 18 - Reference

  # split up the line based on the '|' character
  # note: we use '\|' since writing '|' on its own has a special meaning
  my @cols = split /\|/, $line;

  # create a HASH / associative array to provide labels for these values
  # note: arrays start from '0' so we take one away from the columns mentioned above
  my %genemap_entry = (
    id                 => $cols[0],
    month_entered      => $cols[1],
    day_entered        => $cols[2],
    year_entered       => $cols[3],
    date_entered       => "$cols[2]-$cols[1]-$cols[3]",   # "Day-Month-Year"
    cytogenic_location => $cols[5],
    gene_symbol        => $cols[6],
    # add more labels for the rest of the columns
  );

  # put a *reference* to this HASH onto our growling array of entries
  push @genemap_entries, \%genemap_entry;
}

# note: the following section is going to print out the following:
#
#   1.1    13-5-13          CTRCT8, CCV
#   1.2    25-9-01      ENO1, PPH, MPB1
#   1.3   22-12-87          ERPL1, HLM2
#   ...        ...                  ...
# 24.51    25-8-98        GCY, TSY, STA
# 24.52    20-3-08                DFNY1
# 24.53     8-2-01                  RPY
#
# Number of Genemap Entries: 15037
#

# go through these entries one by one...
foreach my $gm_entry ( @genemap_entries ) {
# we can use the keys that we defined when creating the HASH
# to access the values for each entry in a meaningful way
# note: $gm_entry is a HASH *reference*
#       to access the data in the HASH: $gm_entry->
printf "%5s %10s %20s\n", $gm_entry->{ id }, $gm_entry->{ date_entered }, $gm_entry->{ cytogenic_location };
}

# let people know how many entries we've processed
print "\n"; # new line
print "Number of Genemap Entries: ", scalar( @genemap_entries ), "\n";
print "\n";

Pre-processing metagenomic data

A vital part of metagenome analysis is first ensuring that you are working with good quality data. During my PhD I have used the tools PRINSEQ and DeconSeq (Schmieder and Edwards, 2011) to process and filter metagenomic data sequenced using the 454 pyrosequencing technology.

PRINSEQ generates general statistics, which can also indicate data quality. Data can be filtered based upon a number of criteria including: the removal of duplicate sequences, the removal of sequence containing x% ambiguous (i.e. “n”) bases, and the removal of sequences with a mean quality score of x. Sequences can be trimmed to a defined length, to exclude poor quality bases, and to trim poly-A/T tails. Finally, your output data can be re-formatted between FASTA+QUAL and FASTQ formats, between DNA and RNA sequences, and sequence IDs can be changed.

DeconSeq is typically used after PRINSEQ and involves the removal of ‘contaminate’ sequences, i.e. sequences that belong to species you do not wish to analyse. Data is scanned, using BLAST, against a reference database of the contaminant species of choice to remove any matching sequences.

Both of these tools are free to use and are available either online or through a standalone version.

Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27:863-864. [PMID: 21278185]

Coping with millions of small files: appending to a tar archive

Most file systems will struggle when you read and write millions of tiny files. Exactly how much a particular file system will struggle will depend on a bunch of factors: the formatting of the file system (ext3, gpfs, etc), hardware configurations (RAM / networking bottlenecks) and so on. However, the take home message is that storing many millions of tiny files on standard file systems (especially network file systems) is going to cause performance problems.

We recently came across this when performing millions of structural comparisons (with SSAP). Each comparison results in a small file, so running 25 million of these comparisons on the HPC caused a number of problems with storage.

The solution? Well, as always, there are lots.

You could store the file contents in a database (rather than a file system). The downside being that this requires the overhead of running the database: dependencies, concurrency issues when accessing from many places at the same time and generally more tools in the chain to go wrong.

Since we’re running this in an HPC environment, we’ll want a solution that is simple, scales well, and requires few dependencies. A possible alternative would be to store all these small files in a single large archive.

Looking at the tar man page, we can create a new tar archive with the ‘-c’ flag:

$ tar -cvf my_archive.tar existing_file

and we can append extra files to that archive with the ‘-r’ flag:

$ tar -rvf my_archive.tar extra_file1
$ tar -rvf my_archive.tar extra_file2

you can then list the contents of that archive with the ‘-t’ flag:

$ tar -t my_archive.tar
existing_file
extra_file1
extra_file2

So, looking at a real example…

Let’s say we split our job of 25 million pairwise comparisons into 2500 lists, each containing 10000 pairs:

$ split --lines 10000 -a 4 -d ssap_pairs.all ssap_pairs.

That will result in a 2500 files, each containing 10000 lines (called ssap_pairs.0001, ssap_pairs.0002, …).

We can then submit a job array to the scheduler so that a single script can process each of these input files.

$ qsub -t 1-2500 ssap.submit.pbs

Rather than post the full contents of the script ‘ssap.submit.pbs’ here – I’ll just focus on the loop that creates these 10000 small alignment files:

# name the archive file
ALN_TAR_FILE=ssap_alignments.`printf "%04d" $SGE_TASK_ID`.tgz

# create the archive (start the archive off with an existing file)
tar -cvf $ALN_TAR_FILE ssap_pairs.all

while read pairs; do
    # name of the alignment file that SSAP will produce
    # (e.g. 1abcA001defA00.list)
    aln_file=`echo "$pairs" | tr -d " "`.list

    # run SSAP
    $BIN_DIR/SSAP $pairs >> ssap_results.out

    # append alignment file to archive
    tar -rvf $ALN_TAR_FILE $aln_file

    # remove alignment file
    \rm $aln_file

done < $PAIRS_FILE

# compress then copy the archive back to the original working directory
gzip $ALN_TAR_FILE 
cp $ALN_TAR_FILE.gz ${SGE_O_WORKDIR}

Job done.