Question

Bowtie Vs Bwa For Indels

4

Entering edit mode

10.7 years ago

Varun Gupta ★ 1.3k

Hello Everyone

I am working on some yeast strains and I am interested in INDELS. The bam files I have are the ones aligned with bowtie as aligner. I ran GATK pipeline for calling snps and INDELS using Unified genotyper but I did not get any INDELS in my vcf file produced. Doing some research as to why I am not getting any INDELS, I came across this post

What methods do you use for short read mapping?

One of the comments says

"I think it's important to note that BWA is one of the few fast mapping algorithms that allows for indels. Tools like Maq and Bowtie will not map reads if there is an insertion or deletion. I have used BWA to map 75bp Illumina reads at 20x coverage to a 30Mb fungal genome with good results."

So I went back to my bam files and tried to query the CIGAR string for I or D but didn't found anything. Does this mean bowtie won't report any reads where insertion or deletion is taking place.

Also I want to know that software's which calls for INDELS uses cigar string to tell whether INDELS are present or not or they use completely different approach?? Because If they make use of CIGAR string from bam file then it is easy to look for INDELS just in the bam file though a vcf file is more appropriate for obvious reasons like more detailed explanation.

So if bowtie does not report INDELS should i use bwa to align my reads and then look for INDELS.

Hope to hear from you soon.

Regards

Varun

bowtie bwa • 11k views

ADD COMMENT • link updated 10.7 years ago by lh3 33k • written 10.7 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

How long are the INDELs you're looking for?

ADD REPLY • link 10.7 years ago by David Langenberger 11k

Ram · Answer 1 · 2014-04-08

Bowtie1 does not report any indels. If an indel is in the middle of a read, the read will be unmapped. If the indel is close to the end of a read, it will be mapped with multiple mismatches towards the end. Bowtie2 and the bwa series are able to map reads with indels.

Most variant callers can only call indels when the mapper reports them. GATK HaplotypeCaller (HC) is an exception. It does local assembly and is able to call an indel even if no reads are mapped with the indel. HC indel calling might work with bowtie1 alignment at least in theory, but even if it works, the sensitivity is probably low. Anyway, for variant calling, don't use bowtie1. Use bowtie2/bwa/bwa-mem or other modern mappers.

BTW, whether to allow indels in seeds is not critical. Both blasr and bwa-mem use exact seeds and they work well with PacBio reads with ~15% indel error rate (or >20% for read-to-read mapping). As to chimeric alignments, most Sanger read mappers report them. In the local multi-hit mode, bowtie2 and a few other NGS mappers can find chimeric alignments as well. Bwa-sw has been reporting chimeric alignments by default since 2010. Bwa-mem does that, too.

Ram · Answer 2 · 2014-04-08

5

Entering edit mode

10.7 years ago

David Langenberger 11k

I would recommend you Segemehl: A Fast One-Stop-Shop Mapping Tool. It allows insertions, deletions and mismatches not only in the complete alignment, but also in the seed.

Note: If you have long insertions, it will split the reads instead of writing this information to the CIGAR string. It behaves like that, because the split-read option (-S) can not only find insertions, but also greater genome rearrangements. The mapping of split reads is thus not fixed to one chromosome, or one direction. And changing chromosomes, etc. cannot be stored in the CIGAR-String.

ADD COMMENT • link 10.7 years ago by David Langenberger 11k

3

Entering edit mode

Here are some comparisons of different tools:

Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, Stadler PF, Hackermueller J: "Fast mapping of short sequences with mismatches, insertions and deletions using index structures", PLoS Comput Biol (2009) vol. 5 (9) pp. e1000502

enter image description here

Christian Otto, Peter F. Stadler, and Steve Hoffmann: 'Lacking alignments? The next generation sequencing mapper segemehl revisited', Bioinformatics 2014 : btu146v1-btu146 (2014)

enter image description here

ADD REPLY • link 10.7 years ago by David Langenberger 11k

1

Entering edit mode

Hmm... When I read benchmarks, I leave out the mapper developed by the authors. Segemehl has rarely (if ever) been evaluated in other papers. Even the fairly complete Bioinformatics review (Fonseca et al 2012) has not included it. I intended to try it in the bwa-mem manuscript, but did not have a machine with enough RAM. It is always hard for me to put Segemehl at the right position.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.7 years ago by lh3 33k

1

Entering edit mode

I think 60 GB RAM (what one needs for the human genome) is not a problem any more. Running NGS experiments for several thousand $, but not having the money for some GB of memory seems odd to me. But we had that discussion before and I know that you don't have a machine with enough memory! :) Nevertheless, you don't have to trust the benchmark, nor do you have to use segemehl. To me, the benchmark looks quite impressive and if it is correct, I don't care about any reviews, or if the tool is well known. If it performs better and I get better results, I take it.

ADD REPLY • link 10.7 years ago by David Langenberger 11k

score 3 · Answer 3 · 2014-04-07

3

Entering edit mode

10.7 years ago

Istvan Albert 102k

Does this mean bowtie won't report any reads where insertion or deletion is taking place

Just a clarification, it is not that it won't report the insertion or deletion, what happens is that the algorithm is unable align that read at all.

This means that a read with an INDEL will be reported as unmapped with no alignment information.

ADD COMMENT • link 10.7 years ago by Istvan Albert 102k

0

Entering edit mode

@Istvan : Hi, Have you seen a case where reads are mapped with bowtie and CIGAR string ever shows I or D in it. Also if I or D are present in the CIGAR string (mapped with other aligner), does snp calling softwares make use of the CIGAR string to make the vcf file for INDELS??

Thanks

ADD REPLY • link 10.7 years ago by Varun Gupta ★ 1.3k

2

Entering edit mode

Well since the CIGAR string does not even list a mismatch (M means match or mismatch) the SNP calling software can't use the CIGAR strings to determine even simple polymorphisms.

ADD REPLY • link 10.7 years ago by Istvan Albert 102k