Coping with millions of small files: appending to a tar archive

Most file systems will struggle when you read and write millions of tiny files. Exactly how much a particular file system will struggle will depend on a bunch of factors: the formatting of the file system (ext3, gpfs, etc), hardware configurations (RAM / networking bottlenecks) and so on. However, the take home message is that storing many millions of tiny files on standard file systems (especially network file systems) is going to cause performance problems.

We recently came across this when performing millions of structural comparisons (with SSAP). Each comparison results in a small file, so running 25 million of these comparisons on the HPC caused a number of problems with storage.

The solution? Well, as always, there are lots.

You could store the file contents in a database (rather than a file system). The downside being that this requires the overhead of running the database: dependencies, concurrency issues when accessing from many places at the same time and generally more tools in the chain to go wrong.

Since we’re running this in an HPC environment, we’ll want a solution that is simple, scales well, and requires few dependencies. A possible alternative would be to store all these small files in a single large archive.

Looking at the tar man page, we can create a new tar archive with the ‘-c’ flag:

$ tar -cvf my_archive.tar existing_file

and we can append extra files to that archive with the ‘-r’ flag:

$ tar -rvf my_archive.tar extra_file1
$ tar -rvf my_archive.tar extra_file2

you can then list the contents of that archive with the ‘-t’ flag:

$ tar -t my_archive.tar
existing_file
extra_file1
extra_file2

So, looking at a real example…

Let’s say we split our job of 25 million pairwise comparisons into 2500 lists, each containing 10000 pairs:

$ split --lines 10000 -a 4 -d ssap_pairs.all ssap_pairs.

That will result in a 2500 files, each containing 10000 lines (called ssap_pairs.0001, ssap_pairs.0002, …).

We can then submit a job array to the scheduler so that a single script can process each of these input files.

$ qsub -t 1-2500 ssap.submit.pbs

Rather than post the full contents of the script ‘ssap.submit.pbs’ here – I’ll just focus on the loop that creates these 10000 small alignment files:

# name the archive file
ALN_TAR_FILE=ssap_alignments.`printf "%04d" $SGE_TASK_ID`.tgz

# create the archive (start the archive off with an existing file)
tar -cvf $ALN_TAR_FILE ssap_pairs.all

while read pairs; do
    # name of the alignment file that SSAP will produce
    # (e.g. 1abcA001defA00.list)
    aln_file=`echo "$pairs" | tr -d " "`.list

    # run SSAP
    $BIN_DIR/SSAP $pairs >> ssap_results.out

    # append alignment file to archive
    tar -rvf $ALN_TAR_FILE $aln_file

    # remove alignment file
    \rm $aln_file

done < $PAIRS_FILE

# compress then copy the archive back to the original working directory
gzip $ALN_TAR_FILE 
cp $ALN_TAR_FILE.gz ${SGE_O_WORKDIR}

Job done.