Question

Strategy For Dealing With Reference Files (Genomes) That Are Larger Than The Software Limits

4

Entering edit mode

12.2 years ago

Pawel Szczesny 3.2k

Some tools (bowtie2, usearch, etc.) do not accept input data or reference databases above certain size. What is an efficient strategy for dealing with such cases other than manual splitting (and later manual assembly of result files)? The more automated and reproducible the better.

EDIT: More specifically, I'm looking for comments on:

random splitting (given that some tools use heuristics, I'm not that sure I'll get the same results each time if I cut the data into halves in a different way
existing tools that are capable of split the data into smaller chunks (the most obvious example is formatdb or makeblastdb from NCBI, which produces formatted db in chunks of about 1GB)
merging results from different chunks (input or reference)

fasta analysis • 2.7k views

ADD COMMENT • link 12.2 years ago by Pawel Szczesny 3.2k

1

Entering edit mode

This is probably not helpful, but I generally try to find better software. Splitting up may have made sense when BLAST was written in the previous century, it doesn't make much sense now.

ADD REPLY • link 12.2 years ago by Ketil 4.2k

score 4 · Answer 1 · 2013-01-17

For automation and reproducibility, use a makefile in conjunction with Sun Grid Engine (SGE) qmake, or shell scripts coordinated with SGE qsub -hold_jid or qsub-ed job arrays (or similar options with other cluster job schedulers).

One nice thing about a makefile-based workflow is that it can alleviate the need to regenerate intermediate results and thus save time during development. Once you think about your workflow as a chain of parent and child targets, it is very easy to set up the different hierarchies you need for various analyses that are similar, but need minor tweaks.

For example, a target A may rely on parent targets B and C. But you might slightly modify the A workflow, copying it to A-mod and adding another parent target D:

A: B C
    ...
A-mod: B C D
    ...

(The ... signifies commands that run once A and A-mod dependencies are completed and available.)

In turn, you can have downstream targets that rely on the successful completion of A and A-mod:

compare_A_and_A-mod: A A-mod
    ...

And so on.

Also, you can mix Perl, shell, and other scripting languages very freely in a makefile. It's quite a powerful and expressive way to script, and it is well-suited to the hierarchical workflows found in bioinformatics settings.