Running BWA parallel on 1000 samples
2
3
Entering edit mode
9.1 years ago
ebrown1955 ▴ 320

So I have 1000 samples (about 2000 fastq files) that I need to do "bwa mem" on. I'm trying to do this is the most time-efficient way possible. Is there a way to parallelize this command?

I figure that I can use GNU/Parallel, however what do I do with the header definition (i.e. -R)

For example, my current command reads:

parallell bwa mem -R "@RG\tID:{}\tPL:ILLUMINA\tLB:lib1" $REF {}_1.fastq {}_2.fastq > {}.sam

Does this make sense?

alignment BWA • 4.8k views
ADD COMMENT
2
Entering edit mode

What arguments are you supplying to parallel? Sample names/IDs, full files, what? Parallel should be able to add the value that you want to the RG header. What is the exact issue you're having?

Karl is correct, there is a trade off between parallelization and IO. Run too many at once and you'll clog up the IO and the overall runtime will be equal to that of a smaller number of simultaneous jobs.

ADD REPLY
0
Entering edit mode

Equal is not the lower bound. It can get much slower than sequential operation if the operating system tries to run them all together and has to swap out memory, or load/unload/reload data as different subtasks take over the limited cpu resources.

ADD REPLY
0
Entering edit mode

Yep, you can slow it down to worse than serial.

ADD REPLY
0
Entering edit mode

I was hoping to pass sample IDs into parallel. For example, if an id is XX0001, then the fwd fastq is called XX0001_1.fastq and the rev is XX0001_2.fastq.

I think there would be some variant of parallel bwa mem -R "@RG\tID:{}\tPL:ILLUMINA\tLB:lib1" $REF {}_1.fastq {}_2.fastq > {}.sam

What command would actually accomplish this. Should I create a listfile with the sample IDs?

ADD REPLY
0
Entering edit mode

You should check out the documentation on GNU parallel some more, you have to feed parallel something to use and there are tons of ways you can do that. You should try out some of the examples and play around seeing how it behaves using echo.

You could do something like this (although parsing ls is bad):

Say my files are in the format of (sampleID1_1.fq, sampleID1_2.fq) and so on:

parallel -j 4 bwa mem -t 3 -R "@RG\tID:{}\tPL:ILLUMINA\tLB:lib1" <ref genome> /input/dir/{}_1.fastq /input/dir/{}_2.fastq ">" /output/dir/{}.sam ::: `ls /input/dir/*.fastq | cut -d '_' -f1 | uniq`

Where input and output are wherever your input and outputs are/go. Again, tune -j and -t to whatever your system uses. Unless you're running these on a large number of nodes (which won't work with the above example), I wouldn't run 1000 of them at a time.

ADD REPLY
2
Entering edit mode
9.1 years ago

I use a make or SGE+qmake

see also: https://github.com/lindenb/ngsxml "generating workflows with XML and XSLT"

ADD COMMENT
1
Entering edit mode

Could you share an example qmake script for a simple bwa/samtools alignment?

ADD REPLY
1
Entering edit mode
9.1 years ago

For my computer, the bwa's built in parallelism (-t argument) can use up all the CPU and become bottlenecked by harddrive speed. How fast it can read in and write out files, or it will have used up all the CPU on a single sample.

That means you'll get as good or better speed by doing each sample sequentially.

Just use a bash for loop to run bwa mem with -t set to your cpu count. GNU Parallel can do something like that with the --jobs parameter to limit the number of simultaneous tasks.

If you have multiple compute nodes on a grid, then my answer is wrong and you should distribute the workload.

ADD COMMENT
0
Entering edit mode

-t limits the number of cores each BWA instance can use. -j (parallel) limits the number of simultaneous instances.

So parallel -j 4 bwa mem -t 4 {etc} would be 4 BWA instances each using 4 cores, for a total of 16 cores.

ADD REPLY

Login before adding your answer.

Traffic: 1875 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6