Question

What Methods Do You Use For Short Read Mapping?

16

Entering edit mode

14.7 years ago

Biostar User ★ 1.0k

When it comes to short read mapping there seemingly is no shortage of methods or software to choose from. Yet in practice we found that some published methods did now work at all, others exhibited suboptimal behaviors.

What short read mappers do you use?
How many reads do you need align and what is the size of the genome that you align to?
What are the typical computational resources: parallel processes/CPU/memory required for the completion of the task?
What is your overall assessment of the procedure: easy, tedious, fun?

Note: we're primarily looking to hear of your first hand, personal experiences with any given tool.

short-read-aligner sequence • 21k views

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 14.7 years ago by Biostar User ★ 1.0k

0

Entering edit mode

Anyone tried MUMmer (or MUMmergpu)?
Bill

ADD REPLY • link updated 5.3 years ago by Ram 44k • written 13.3 years ago by Bill ▴ 20

Ram · Answer 1 · 2010-03-07

12

Entering edit mode

14.7 years ago

Allen Yu ▴ 200

We map the reads from the illumina and SoLid platform by using BWA. http://bio-bwa.sourceforge.net/

For bacterial genomes, we choose the illumina platform. About 50X coverage of the total reads was obtained against the ~4Mb reference genome. The mapping process took about 1 hour by using 7 cpus.

For eukaryotic genomes, we choose the SoLid platform. About 25x coverage of the total reads were obtained against the ~500Mb reference genome. The mapping process took about 2 days by using 7 cpus.

Generally reads mapping by using BWA is reliable.

ADD COMMENT • link updated 13 months ago by Ram 44k • written 14.7 years ago by Allen Yu ▴ 200

4

Entering edit mode

I think it's important to note that BWA is one of the few fast mapping algorithms that allows for indels. Tools like Maq and Bowtie will not map reads if there is an insertion or deletion. I have used BWA to map 75bp Illumina reads at 20x coverage to a 30Mb fungal genome with good results.

ADD REPLY • link updated 13 months ago by Ram 44k • written 14.7 years ago by Rob Syme ▴ 540

2

Entering edit mode

Novoalign and stampy are both good, perhaps better than bwa at gapped alignment. They may be a little slower, though.

ADD REPLY • link 14.1 years ago by lh3 33k

score 7 · Answer 2 · 2010-03-04

I have used tophat (which also calls bowtie). It seemed pretty straightforward, I'm not sure I would call it "fun", but I think tophat does a good job providing useful output formats. Other people around here use Eland.

I was aligning 60-mer reads - 15-20 million per lane?

This was to the mouse genome, so about 2.7 gigabases.

I don't know what computational resources were required, but I was running it on a server with 96 gigs of RAM and 16 cpus. Much more than I needed.

I'm actually not sure how long it took per lane, I just set it up and then left it while I worked on other stuff for awhile.

Ram · Answer 3 · 2010-03-04

For mapping reads obtained from the SOLiD platform we use SHRiMP;

47 million 50bp long reads in colorspace
We are aligning against the human genome, ~ 3 billion bases
A typical runtime is 12 hours for every 1 million reads. We split the 47 million reads into about 25 datasets and run them in parallel. SHRiMP's memory use depends on the size of the reads that needs to align: approx 1.6 GB per 1 million reads.
Overall we process the entire dataset in about a day
We like using the SHRiMP program. It is simple to use, has very clear documentation and no other dependencies. Importantly it easy to teach people how to use it. On the other hand it is probably a slower method than many others.

Ram · Answer 4 · 2010-03-09

5

Entering edit mode

14.7 years ago

Chris Miller 22k

It's worth noting that a lot of people also use Novoalign

ADD COMMENT • link updated 13 months ago by Ram 44k • written 14.7 years ago by Chris Miller 22k

Ram · Answer 5 · 2010-03-04

4

Entering edit mode

14.7 years ago

Pierre Lindenbaum 164k

I implemented my personal suffix-array algorithm ( = perfect match ) because bowtie was too slow for my needs. I wrote about it here : http://plindenbaum.blogspot.com/2010/01/elementary-school-for-bioinformatics.html.

It aligned all the 60 mers for each side of each SNPs (17E6 * 2 sequences) from dbSNP in about ~12H00.

ADD COMMENT • link updated 13 months ago by Ram 44k • written 14.7 years ago by Pierre Lindenbaum 164k

5

Entering edit mode

Actually for perfect matching, the BWT-based algorithm achieves almost the best balance between speed and memory. It can be extremely fast given huge memory or still fast given limited memory. Suffix array costs too much memory. The standard suffix array has worse theoretical time complexity. Nonetheless, as those BWT aligners are not specifically designed for perfect match, they usually come with a large constant. A specialized algorithm can be faster.

ADD REPLY • link 14.1 years ago by lh3 33k

1

Entering edit mode

That's a really nice description of the problem and code, Pierre!

ADD REPLY • link 14.7 years ago by Istvan Albert 102k

0

Entering edit mode

that's really neat Pierre!

ADD REPLY • link 14.7 years ago by Istvan Albert 102k

0

Entering edit mode

So you're only interested in exact matches, did I get that right? If not, what is your strategy to find inexact matches?

ADD REPLY • link 14.7 years ago by Konrad ▴ 80

0

Entering edit mode

yes, only extact matches

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 164k

Ram · Answer 6 · 2010-03-04

3

Entering edit mode

14.7 years ago

Giovanni M Dall'Olio 28k

A group in my institute has developed a tool called GEM for mapping short reads and in general working with next gen sequencing data, like mapping cDNAs, find splicing isoforms, etc... I never used it directly but I have attended some talks on this and it seems convincing.

In particular, to map short reads you should use the tool gem_mapper.

ADD COMMENT • link updated 13 months ago by Ram 44k • written 14.7 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 7 · 2010-03-07

I've only used bowtie, but it seems to be extremely fast and makes use of multiple cores with no extra work on my part. Also, builds an index for the reference sequence which can be re-used after the first build.

This is mapping to Arabidopsis Thaliana, up to 5 or so Gigs of raw reads, so fastq of 4x that size. Using a pretty standard 8 core machine, it's relatively painless.

UPDATE:

We've also found gsnap to be excellent. It can do fasta/fastq, spliced alignment (RNA-Seq), BS-Seq, and general mapping very quickly. It is a bit slower than bowtie but handles indels much better. Though it can read fastq files, it does not use the quality information so it is best to trim reads before sending through gsnap.

Ram · Answer 8 · 2010-10-31

We develop MOSAIK for next-generation sequencing data, and apply it for a bunch of human genetic analyses. http://code.google.com/p/mosaik-aligner/

10 million 36bp in illumina against human reference
in an hour
8 processors / 7.2 Gb (There is an option to reduce memory under 3Gb, but may lose accuracy a little bit.)

MOSAIK is still active developed and maintained. Please join us to use MOSAIK, and give us your feedbacks.

Ram · Answer 9 · 2010-11-01

I think, what is missing is a larger scale evaluation of the merits (efficiency, speed, sensitivity) of the different algorithms.

See Li & Homer for a recent review of the different methods.

Furthermore, we were missing bfast in this list.

It is also worth noting that many of the aligners allow only for a fixed number of mismatches. This is ok for reads of constant length, but when it comes to reads of variable length (454), I would prefer a percent-identity parameter. So, I also tried blat and lastz for mapping. They are not as fast as the dedicated short read aligners, but by tuning the seeding parameters of esp. lastz one can gain good control over sensitivity vs. speed.

A hybrid approach combining different aligners of different speed in a workflow which applies the algorithms in the order of their sensitivity might be an interesting approach.

Ram · Answer 10 · 2011-11-04

2

Entering edit mode

13.1 years ago

Erik ▴ 20

bwasw is good for 454, pacbio and iontorrent. ssaha2 is good as well, esp for iontorrent for some reason.

the new bowtie2 rocks for high-qual illumina/solid reads... but bwasw and ssaha2 are still best for lower quality and/or shorter reads.... not sure why, but i align more and more valid/accurate mapping qualities with those.

the older bowtie is good for perfect match /quantification stuff on high qual reads.

ADD COMMENT • link 13.1 years ago by Erik ▴ 20

1

Entering edit mode

Bowtie2 does not see enough hits as a price of being too fast, and thus is insufficient to distinguish good and bad hits. To avoid giving too many false alignments with high mapQ, it has to assign zero mapQ to a lot of "unique" hits. This is basically what you have observed. The bowtie2 poster claims it is the most sensitive because it is counting alignments with all mapQ. If we discard mapQ zero hits, which is frequently what we do, the sensitivity of bowtie2 is worse than smalt, bwa, bwa-sw, gsnap and novoalign.

ADD REPLY • link 13.1 years ago by lh3 33k

0

Entering edit mode

Bowtie2 does not see enough hits as a price of being too fast, and thus is insufficient to distinguish good and bad hits. To avoid giving too many false alignments with high mapQ, it has to assign zero mapping quality to a lot of "unique" hits.

ADD REPLY • link 13.1 years ago by lh3 33k

0

Entering edit mode

See the following post for the opinions for both bowtie developers and me: http://seqanswers.com/forums/showthread.php?t=15200&goto=newpost

ADD REPLY • link updated 13 months ago by Ram 44k • written 13.1 years ago by lh3 33k

Ram · Answer 11 · 2011-08-04

0

Entering edit mode

13.3 years ago

Stuart Inglis • 0

Real Time Genomics has a freely available suite of tools available for Mac OS X, Linux and Windows.

It includes an extremely accurate read mapping and alignment module, for both paired end and single end.

There is a comprehensive manual, and the commands are very simple:

rtg format ... data...
rtg map.... SAM files output to a directory...

http://www.realtimegenomics.com/Blog/Read-mapping-on-large-and-small-RAM-machines

And you can download the software here.

ADD COMMENT • link updated 13 months ago by Ram 44k • written 13.3 years ago by Stuart Inglis • 0

0

Entering edit mode

From this link:

"The offer of a free RTG Investigator download for personal use is no longer available."

ADD REPLY • link 12.4 years ago by Malachi Griffith 20k