So I have 1000 samples (about 2000 fastq files) that I need to do "bwa mem" on. I'm trying to do this is the most time-efficient way possible. Is there a way to parallelize this command?
I figure that I can use GNU/Parallel, however what do I do with the header definition (i.e. -R)
For example, my current command reads:
parallell bwa mem -R "@RG\tID:{}\tPL:ILLUMINA\tLB:lib1" $REF {}_1.fastq {}_2.fastq > {}.sam
Does this make sense?
What arguments are you supplying to parallel? Sample names/IDs, full files, what? Parallel should be able to add the value that you want to the RG header. What is the exact issue you're having?
Karl is correct, there is a trade off between parallelization and IO. Run too many at once and you'll clog up the IO and the overall runtime will be equal to that of a smaller number of simultaneous jobs.
Equal is not the lower bound. It can get much slower than sequential operation if the operating system tries to run them all together and has to swap out memory, or load/unload/reload data as different subtasks take over the limited cpu resources.
Yep, you can slow it down to worse than serial.
I was hoping to pass sample IDs into parallel. For example, if an id is XX0001, then the fwd fastq is called
XX0001_1.fastq
and the rev isXX0001_2.fastq
.I think there would be some variant of
parallel bwa mem -R "@RG\tID:{}\tPL:ILLUMINA\tLB:lib1" $REF {}_1.fastq {}_2.fastq > {}.sam
What command would actually accomplish this. Should I create a listfile with the sample IDs?
You should check out the documentation on GNU parallel some more, you have to feed parallel something to use and there are tons of ways you can do that. You should try out some of the examples and play around seeing how it behaves using echo.
You could do something like this (although parsing ls is bad):
Say my files are in the format of (
sampleID1_1.fq
,sampleID1_2.fq
) and so on:Where input and output are wherever your input and outputs are/go. Again, tune
-j
and-t
to whatever your system uses. Unless you're running these on a large number of nodes (which won't work with the above example), I wouldn't run 1000 of them at a time.