Question

Benchmarking Read Alignment And Variant-Calling Algorithms (For Dummies)

6

Entering edit mode

13.4 years ago

Travis ★ 2.8k

Hi all,

I am wondering if there is a good step by step guide of how to benchmark alignment and variant calling software. I do understand the premise e.g.

Generate reads with known mutations
Align to genome
Assess accuracy
Perform variant calling
Assess accuracy

However I have some kind of intellectual disconnect when I try to think about how to actually do it. Too much time in industry and not enough in academia I suspect!

Can anyone point me in the right direction?

Thanks in advance!

alignment snp indel algorithm • 4.6k views

ADD COMMENT • link updated 13.4 years ago by Torst ▴ 980 • written 13.4 years ago by Travis ★ 2.8k

score 1 · Answer 1 · 2011-11-07

M. Ruffalo recently published "Seal" which is an evaluation suite for read aligners.

"With a view to comparing existing short read alignment software, we develop a simulation and evaluation suite, Seal, which simulates NGS runs for different configurations of various factors, including sequencing error, indels and coverage"

Reference:

Ruffalo M, Laframboise T, Koyutürk M. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics. 2011 Oct 15;27(20):2790-6. Epub 2011 Aug 19.

http://www.ncbi.nlm.nih.gov/pubmed/21856737

Ram · Answer 2 · 2011-08-04

0

Entering edit mode

13.4 years ago

Travis ★ 2.8k

I think I have answered the aligner part:

http://www.massgenomics.org/short-read-aligners

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 13.4 years ago by Travis ★ 2.8k

5

Entering edit mode

This benchmark is flawed. All read mappers easily achieve <1% error rate on simulated data (accurate mappers <0.1%), while the 2nd plot implies something like 10%. There are also a couple of papers benchmarking the mappers, but they all have problems. The best benchmark I have seen is the one done by the 1000g project, but it is not available publicly.

ADD REPLY • link 13.4 years ago by lh3 33k

0

Entering edit mode

I notice the fake reads were trained on a human sample but used to generate C. elegans reads also. Not sure if this could have affected anything though.

ADD REPLY • link 13.4 years ago by Travis ★ 2.8k

0

Entering edit mode

Also BFast shows as a fast aligner??

ADD REPLY • link 13.4 years ago by Travis ★ 2.8k

0

Entering edit mode

There is another flaw in your analysis: In essence, we have the “right” answer and can use it to determine if a read is placed correctly. You cannot conclude that an alignment is false (marked in red in your bar graphs), if a read does align in a different location than it was generated from. This in fact tells you nothing unless you prove with Smith-Waterman that the optimal local alignment doesn't pass the alignment criteria in this position. It could in fact be a duplicated region.

ADD REPLY • link 13.4 years ago by Michael 55k

0

Entering edit mode

Remember that there is actually an authoritative solution which is Smith-Waterman, thus an aligner which uses Smith-Waterman as a last step should in principle yield no false positives. And the flaw of that evaluation is that it wasn't checked. Therefore, the whole analysis is flawed imho, and gives you absolutely nothing, even though it contains some nice ideas.

ADD REPLY • link 13.4 years ago by Michael 55k