I am working on some yeast strains and I am interested in INDELS. The bam files I have are the ones aligned with bowtie as aligner.
I ran GATK pipeline for calling snps and INDELS using Unified genotyper but I did not get any INDELS in my vcf file produced. Doing some research as to why I am not getting any INDELS, I came across this post
"I think it's important to note that BWA is one of the few fast mapping algorithms that allows for indels. Tools like Maq and Bowtie will not map reads if there is an insertion or deletion. I have used BWA to map 75bp Illumina reads at 20x coverage to a 30Mb fungal genome with good results."
So I went back to my bam files and tried to query the CIGAR string for I or D but didn't found anything. Does this mean bowtie won't report any reads where insertion or deletion is taking place.
Also I want to know that software's which calls for INDELS uses cigar string to tell whether INDELS are present or not or they use completely different approach??
Because If they make use of CIGAR string from bam file then it is easy to look for INDELS just in the bam file though a vcf file is more appropriate for obvious reasons like more detailed explanation.
So if bowtie does not report INDELS should i use bwa to align my reads and then look for INDELS.
Bowtie1 does not report any indels. If an indel is in the middle of a read, the read will be unmapped. If the indel is close to the end of a read, it will be mapped with multiple mismatches towards the end. Bowtie2 and the bwa series are able to map reads with indels.
Most variant callers can only call indels when the mapper reports them. GATK HaplotypeCaller (HC) is an exception. It does local assembly and is able to call an indel even if no reads are mapped with the indel. HC indel calling might work with bowtie1 alignment at least in theory, but even if it works, the sensitivity is probably low. Anyway, for variant calling, don't use bowtie1. Use bowtie2/bwa/bwa-mem or other modern mappers.
BTW, whether to allow indels in seeds is not critical. Both blasr and bwa-mem use exact seeds and they work well with PacBio reads with ~15% indel error rate (or >20% for read-to-read mapping). As to chimeric alignments, most Sanger read mappers report them. In the local multi-hit mode, bowtie2 and a few other NGS mappers can find chimeric alignments as well. Bwa-sw has been reporting chimeric alignments by default since 2010. Bwa-mem does that, too.
ADD COMMENT
• link
updated 5.2 years ago by
Ram
44k
•
written 10.7 years ago by
lh3
33k
Note: If you have long insertions, it will split the reads instead of writing this information to the CIGAR string. It behaves like that, because the split-read option (-S) can not only find insertions, but also greater genome rearrangements. The mapping of split reads is thus not fixed to one chromosome, or one direction. And changing chromosomes, etc. cannot be stored in the CIGAR-String.
Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, Stadler PF, Hackermueller J: "Fast mapping of short sequences with mismatches, insertions and deletions using index structures", PLoS Comput Biol (2009) vol. 5 (9) pp. e1000502
Christian Otto, Peter F. Stadler, and Steve Hoffmann: 'Lacking alignments? The next generation sequencing mapper segemehl revisited', Bioinformatics 2014 : btu146v1-btu146 (2014)
Hmm... When I read benchmarks, I leave out the mapper developed by the authors. Segemehl has rarely (if ever) been evaluated in other papers. Even the fairly complete Bioinformatics review (Fonseca et al 2012) has not included it. I intended to try it in the bwa-mem manuscript, but did not have a machine with enough RAM. It is always hard for me to put Segemehl at the right position.
ADD REPLY
• link
updated 5.2 years ago by
Ram
44k
•
written 10.7 years ago by
lh3
33k
1
Entering edit mode
I think 60 GB RAM (what one needs for the human genome) is not a problem any more. Running NGS experiments for several thousand $, but not having the money for some GB of memory seems odd to me. But we had that discussion before and I know that you don't have a machine with enough memory! :) Nevertheless, you don't have to trust the benchmark, nor do you have to use segemehl. To me, the benchmark looks quite impressive and if it is correct, I don't care about any reviews, or if the tool is well known. If it performs better and I get better results, I take it.
@Istvan : Hi, Have you seen a case where reads are mapped with bowtie and CIGAR string ever shows I or D in it.
Also if I or D are present in the CIGAR string (mapped with other aligner), does snp calling softwares make use of the CIGAR string to make the vcf file for INDELS??
Well since the CIGAR string does not even list a mismatch (M means match or mismatch) the SNP calling software can't use the CIGAR strings to determine even simple polymorphisms.
How long are the INDELs you're looking for?