Which command line programs are parallelizable?
3
4
Entering edit mode
9.6 years ago
Chris Rhodes ▴ 50

I'm just curious if anybody can suggest command line programs that are designed to be truly parallelized?

I found a great post on running serial programs in parallel with GNU parallel: Tool: Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

However, instead of using tricks to run serial programs in a parallel fashion, is there anything out there that can be run in parallel directly?

I'm mainly interested in programs used for chip-seq and rna-seq workflows. For example, I think STAR aligner can handle parallel processes, but I believe common tools like tophat, MACS, and cufflinks are all serial at some point.

Thanks in advance for any input.

RNA-Seq ChIP-Seq sequence • 3.7k views
ADD COMMENT
1
Entering edit mode

I guess you are referring to true threads vs. forks (i.e., separate processes). This would have to be implemented in the code of the program and may require modifications to the underlying code (may also require additional libraries). What do you think this would gain? What I mean is you could use threads from a script or program to run your pipeline but I'm not sure this would be worth the effort vs. using something already available.

ADD REPLY
0
Entering edit mode

Since you mentioned them, both tophat and cufflinks are multithreaded already. In fact, it'd be hard to find a single-threaded aligner or assembler, since no one would use them.

ADD REPLY
0
Entering edit mode

bwa-aln and Tophat are both parallel, but they have single-threaded components that can bottleneck them on many-core systems. For bwa-aln that's the sampe/samse stage; for TopHat, it's some Perl code (IIRC). I could be wrong about TopHat; I'm just basing that on my observation of top.

ADD REPLY
0
Entering edit mode

Tophat does have a single threaded optional step, that's correct. I'm not sure how many people actually use bwa-aln anymore, bwa mem works better in most cases.

ADD REPLY
3
Entering edit mode
9.6 years ago

Since you asked for tools for the analysis of RNA-seq and ChIP-seq data: there's deepTools, seqminer and spark that all allow you to make use of multiple processors. I only have worked with deepTools though and of course, I think they're pretty great and useful (I helped develop them)

ADD COMMENT
3
Entering edit mode
9.6 years ago

Most of BBTools is completely parallel. This includes BBMap/BBSplit (even the indexing is parallel), BBDuk, BBNorm, Seal, BBMerge, Dedupe, DemuxByName, KmerCountExact, and a few others. The programs that are not fully parallel (reformat, repair, pileup, stats, splitnextera, translate6frames, filterbyname, etc) are typically limited by disk I/O speed anyway, but still use at least one thread per file being read or written.

I'd like to also mention one of my favorite programs, pigz, which does gzip in parallel and is thus very useful in bioinformatics. If pigz is installed, all BBTools will use it automatically when reading or writing gzipped content. This greatly increases the speed of, for example, reformat.sh in=reads.fastq.gz out=reads.fasta.gz.

ADD COMMENT
1
Entering edit mode

BBTools/BBMap looks like an excellent program. Thank you for pointing me to that - I plan on trying it on an upcoming project!

ADD REPLY
1
Entering edit mode

I would also like to say that I have been hearing good things about BBMap/Tools lately.We have a decent multiprocessor cluster and I hate it when a few jobs pose to be bottlenecks because they can't be run in parallel. I am excited to try it for my new genomic sequencing data sets.

ADD REPLY
0
Entering edit mode

Dear Brian, when running bbduk, it is restricted to only one pair of fastq.gz files, how can it be overcome?

ADD REPLY
1
Entering edit mode

Hi lamteva,

Perhaps you could explain what you are trying to do? When you have multiple pairs of fastq files, it's typical to use one of the following approaches:

1) Concatenate all of the read 1 files together and all of the read 2 files together. Then, process them all at once. This can be done when you have a single library that happens to be split across multiple lanes so you needlessly have lots of files with no real difference.

2) Process each pair separately. This is the best approach when you have multiple different libraries that will be used independently. In this case, you would typically run one BBDuk process per file pair, in a bash loop, or create a shellscript with one BBDuk command per line and execute that. Since BBDuk is already parallel, you don't need to run multiple instances at once.

ADD REPLY
1
Entering edit mode
9.6 years ago

BEDOPS supports parallelized workflows. For instance, just split the BED file by chromosome with bedextract (or specify the chromosome name when using unstarch on a Starch archive) and work on the problem piece by piece with your job scheduler of choice. There are no "tricks" with distributing work, just different ways to go about its implementation.

For instance, another way is via multithreading, where memory is shared between threads. A compression tool like pbzip2 splits the computation work up into separate threads. This can be useful on multiprocessor or multicore workstations, where the OS kernel will assign a thread to a processor.

Because the memory is shared, concurrency is a design issue and can complicate the code. I think pbzip2 uses pthreads, while rMAT uses GCD under OS X. GCD tries to reduce some of the work here, as compared with pthread-based apps, but its use seems confined to OS X.

Tools that enable Open MPI support might also be of interest, as well. One such tool is MEME's meme_p. Open MPI involves passing copies of messages or chunks of data between computation units. This eliminates some concurrency problems with threading, but the overhead is higher.

Hopefully this provides some keywords to search for!

ADD COMMENT
0
Entering edit mode

I've used BEDOPS in the recent past. It is amazingly fast, but I didn't realize it also supported parallel workflows - something new to try. Also, thanks for all the keywords to look into!

ADD REPLY

Login before adding your answer.

Traffic: 1679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6