Question

How To Think In Parallel In Bioinformatics

16

Entering edit mode

14.1 years ago

Andrea_Bio ★ 2.8k

Hi

I've been a 'standard programmer' for many years but have recently moved into bioinformatics and I can see that the types of programs I need to write now are fundamentally different form what I used to write: due to the huge volume of data and amount of processing performed means I need to shift my mindset when I design programs from a 'serial' design to a 'parallel' design.

I've had a look for books on parallel computing and they are WAY too technical for what I'm looking for. I'm looking for tutorials/guidelines, not necessarily specific to bioinformatics, to make me start 'thinking in parallel' if that makes sense.

I was also wondering if there was software available where you could emulate a multi processor environment. I was thinking perhaps if I started working in a multi processor envinonment I would start thinking that way.

As a basic example I have a perl script I've been passed to work on which sends 12 millions SNPs to one function, waits patiently for this function to return and then shuttles the 12 million SNPs off somewhere else. I've never heard the term before today but this is an embarassigly parallel problem. It's crying out for parallelism (or parallelisation - see i don't even know the right words!) but I would have implemented the code in exactly the same way myself as I don't think parallel.

So really, I'm looking for books/tutorials/websites/guides to help me think parallel. I'd also value other people's insights and experiences but I'm aware that this is a vague question and that this forum appreciates questions where you can have a direct answer and not simply discuss things.

Thanks in advance for your help

parallel • 6.7k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 14.1 years ago by Andrea_Bio ★ 2.8k

4

Entering edit mode

Somewhat tongue in cheek but I found the following a great "guide" http://teddziuba.com/2010/10/taco-bell-programming.html

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 14.1 years ago by Istvan Albert 101k

2

Entering edit mode

Cross-posted to Quora: http://www.quora.com/How-to-think-parallel-in-bioinformatics

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 14.1 years ago by Mndoci ★ 1.2k

1

Entering edit mode

If you buy a quad-core or 6-core machine to work off of, you won't have to emulate a multi-processor environment in software. That's entry-level hardware these days.

ADD REPLY • link 14.1 years ago by David Quigley 11k

0

Entering edit mode

thank-you everyone for your answers. It is much appreciated

ADD REPLY • link 14.1 years ago by Andrea_Bio ★ 2.8k

score 11 · Answer 1 · 2010-11-05

A little searching will get you tons of how-tos talking about the map-reduce paradigm, but I've always found the terminology confusing. It's easier for me to think about it as a three step process: split, process, and combine

Split means figure out how to subset your data into chunks
Process means run some scripts to do the computation on each chunk in parallel
Combine means take the results and put them back together.

This is going to be the basic framework for pretty much all of your parallel scripts. Wrap your head around this and you're 90% of the way there.

Some other scattered thoughts:

Don't worry. It takes time to train yourself to see problems in this way. Once you get it, though, you start seeing the patterns everywhere.
If data is relatively small, don't waste time parallelizing your code. As a bioinformatician, your goal is usually to get the data munged quickly, rather than producing perfect and optimized code.
If you've got big data that's easy to split (like your 12M SNPs), think about what level is easiest to split at. You can physically split the data into seperate files, then launch the same script on subsets of the data. Alternately, you can parcel out different threads or processes from within your scripts.

I usually use the first approach. Got an 8-core machine (or a cluster)? Use split to chop the data into smaller chunks, then farm out the processes using a for loop or gnu parallel. Cat the results together and you're in business.
Since you're working with genomic data, processing each chromosome independently is often a easy and natural way to split the data.
Systems like Hadoop are great too, but learning a system like that is probably overkill, at least at first.
If you're using R on a single multicore machine, I can recommend the doMC/foreach packages. They're fairly intuitive and work well.

score 11 · Answer 2 · 2010-11-05

11

Entering edit mode

14.1 years ago

Kraut ▴ 230

An important part of any parallel programming is thoughtful domain decomposition. In biology you should look for natural parallelism in the objects themselves. For next-generation sequencing, that means parallel by sample, flowcell, tile, or lane. For other problems it's by gene, chromosome, domain, or conformation.

Once you've decomposed the problem you can choose a technology that enables you to express that parallelism by message passaging, map reduce, or batch processing.

ADD COMMENT • link 14.1 years ago by Kraut ▴ 230

0

Entering edit mode

Couldn't have put it better. Ben Langmead talks about this quite eloquently about this in his talks on Myrna. The good thing about Hadoop and similar frameworks is that there you don't really have to worry about the non-domain aspects of parallelism.

ADD REPLY • link 14.1 years ago by Mndoci ★ 1.2k

0

Entering edit mode

Nicely put - I'd also add the natural parallelism inherent in statistical approaches (resampling, permutation, parametrization, cross-validation)

ADD REPLY • link 14.1 years ago by Hanif Khalak ★ 1.3k

score 5 · Answer 3 · 2010-11-05

5

Entering edit mode

14.1 years ago

Bio_X2Y ★ 4.4k

If you're just doing once-off jobs for yourself, I'd echo Istvan's sentiments and suggest you make use of simple techniques where feasible (if that's what he's suggesting!)

e.g. in the SNP example, if you have four processors, maybe you can just break the file into 4 smaller files, each with 3 million SNPs, and kick off four instances of the SNP script, each reading from its own input file and writing to its own output file.

I've rarely dabbled in fully-fledged parallel programming, but I've always found it time-consuming and error-prone. I almost always try to avoid it these days, opting instead to invoke smaller jobs in parallel from the command line.

ADD COMMENT • link 14.1 years ago by Bio_X2Y ★ 4.4k

0

Entering edit mode

I agree, when exploiting straighforward data parallelism - it pays to think outside the bun.. Not all HPC can be done one one machine and one datastore though, which might take you MPI and beyond

ADD REPLY • link 14.1 years ago by Hanif Khalak ★ 1.3k

Ram · Answer 4 · 2010-11-08

I think you'll find that doing things in parallel requires you to learn a lot of technicalities. I'd advise you to concentrate on the low hanging fruit. To me, these are e.g. simple shell-scripts running sub-processes in parallel:

# process databases in parallel
for d in db1.fasta db2.fasta db3.fasta; do
    blastall -i input.fasta -d $d -o $d.tab -m 8 &
done
wait
# process output here

Or, if you can structure your pipeline in a makefile, you can use 'make -j' to build your targets in parallel. (A makefile uses a declarative language to express which targets (i.e. files) depend on which others, and will recursively build the dependencies of the target you specify. This lets 'make' parallelize a lot of the task automatically.)

Writing effective multi-threaded programs is quite difficult, and introduces lots of new ways for your program to fail. For getting good performance, you usually need to spend a lot of time making sure everything is properly balanced.

Ram · Answer 5 · 2014-11-08

The wonderful thing about bioinformatics is that many of the problems are embarrassingly parallelizeable. This makes it possible to reuse your sequential programming skills and let a general parallelizer such as GNU Parallel deal with the parallelization. Even if you made a specialized parallelized tool you will often not get any noticeably speed improvement over GNU Parallel for that kind of problems.

GNU Parallel has been used by bioinformaticians for years, and several of the options have been developed with bioinformatics in mind. The hard part is to understand how to use it most efficiently. These examples should get you started:

Ram · Answer 6 · 2010-11-07

A good book, that covers a little of every aspect of parallel computing, is "Scalable Parallel Computing" (Amazon link).

The book introduces concepts such as SIMD - single instruction, multiple data - that sounds like it is appropriate for your situation.

While the 2 reviews are from people who did not like the book, for $3 - 4 bucks what do you have to lose?! :-)

Ram · Answer 7 · 2010-11-08

0

Entering edit mode

14.1 years ago

Austinlew ▴ 310

This is a short tutorial to show how to make a multi-threaded Perl program, hope you will get some idea from it.

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 14.1 years ago by Austinlew ▴ 310