Strategy For Dealing With Reference Files (Genomes) That Are Larger Than The Software Limits
1
4
Entering edit mode
11.9 years ago

Some tools (bowtie2, usearch, etc.) do not accept input data or reference databases above certain size. What is an efficient strategy for dealing with such cases other than manual splitting (and later manual assembly of result files)? The more automated and reproducible the better.

EDIT: More specifically, I'm looking for comments on:

  • random splitting (given that some tools use heuristics, I'm not that sure I'll get the same results each time if I cut the data into halves in a different way
  • existing tools that are capable of split the data into smaller chunks (the most obvious example is formatdb or makeblastdb from NCBI, which produces formatted db in chunks of about 1GB)
  • merging results from different chunks (input or reference)
fasta analysis • 2.5k views
ADD COMMENT
1
Entering edit mode

This is probably not helpful, but I generally try to find better software. Splitting up may have made sense when BLAST was written in the previous century, it doesn't make much sense now.

ADD REPLY
4
Entering edit mode
11.9 years ago

For automation and reproducibility, use a makefile in conjunction with Sun Grid Engine (SGE) qmake, or shell scripts coordinated with SGE qsub -hold_jid or qsub-ed job arrays (or similar options with other cluster job schedulers).

One nice thing about a makefile-based workflow is that it can alleviate the need to regenerate intermediate results and thus save time during development. Once you think about your workflow as a chain of parent and child targets, it is very easy to set up the different hierarchies you need for various analyses that are similar, but need minor tweaks.

For example, a target A may rely on parent targets B and C. But you might slightly modify the A workflow, copying it to A-mod and adding another parent target D:

A: B C
    ...
A-mod: B C D
    ...

(The ... signifies commands that run once A and A-mod dependencies are completed and available.)

In turn, you can have downstream targets that rely on the successful completion of A and A-mod:

compare_A_and_A-mod: A A-mod
    ...

And so on.

Also, you can mix Perl, shell, and other scripting languages very freely in a makefile. It's quite a powerful and expressive way to script, and it is well-suited to the hierarchical workflows found in bioinformatics settings.

ADD COMMENT
1
Entering edit mode

Alex, I appreciate your answer, but I'm looking for solutions that are much more specific. I've edited the question to provide more details.

ADD REPLY
1
Entering edit mode

You might want to provide more specifics on the tools, inputs and use cases that are causing you problems.

ADD REPLY

Login before adding your answer.

Traffic: 1509 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6