SAMtools mpileup and bcftools doesn't call InDels...
2
7
Entering edit mode
9.4 years ago

Hi Guys,

I have been researching fair bit over the last few days about SNPs/InDels calling, My project pretty much exactly summarised here. My NGS data has also been generated by cripsr/cas9 method.

I used this method

HaplotypeCaller -R genome.fa -I sorted.bam -stand_call_conf 30 -stand_emit_conf 10 -o sorted-out.vcf -L chrX --downsampling_type NONE

and it sort of worked. I do get some InDels picked up. I want to call the same InDels with samtools mpileup. Two reasons:

  1. I feel I'm not understanding how InDels/SNP being called in general
  2. The actual project is a bit more involved. we are interested in both SNP and InDels some of which will only be found in fewer reads.

This is what I did to try replicate HyplotypeCaller witih samtools mpileup

samtools mpileup -g -f genome.fa sorted.bam > sorted-mpileup.bcf 
bcftools call -c  -V snps sorted-mpileup.bcf | less

I get no InDels found however I do know that there is an InDel from both IGV and GATK HaplotypeCaller. And I also generate pile file and looked into it and it shows me that I do have and InDel...so what is going on...?

Also can anyone reiterate here why mpileup has max per-file depth set to 8000 and how does this effects my SNP/InDel calling given that I have ~ 400,000 reads in my bam files?

Also also when I do

samtool mpileup -g -f genome.fa sorted.bam > sorted-mpileup..bcf

I get DP tag included as a default and given that I have more than 8000 reads in my bam file DP tag for all SNP is 7999. But the question is what is the different when I include -t DP,DPR option during my mpileup..? because I end up with two DP tags in my vcf file with slightly different values..?

(this is really a testRun data and it is an amplicon data however in future I'll be calling SNP/InDel from genome wide NGS data)

Note please assume that I'm complete noob in variant calling and provide as much explanation as you can. much appreciate your time in advance

I understand this is really a few different questions in one, but I believe this will be a good recourse for others to refer to given that SNP/InDels calling is common topic and I couldn't find any good resources that would give a good guide for SNP/InDel calling hence me trying to include as much description about variant calling as I can and hope that others would to through comments and answers.

Cheers

SNP INDELs samtools-mpileup variant-calling • 16k views
ADD COMMENT
0
Entering edit mode

I just want to add a bit more info to this question. The main problem is that I can see an InDel in mpileup format. i.e

samtools -f genome.fa sorted.bam > sorted-mpileup.mpileup 

but when I run above line with an option `-b` meaning convert to bcf format I loose that InDel..

Can anyone at least explain what is happening there..?

Cheers

ADD REPLY
0
Entering edit mode

Kirill, Hi! just very quickly... Did you find any reasonable explanation for the disappearing indels issue? I could not find any other post which would describe the problem in such detail! This would be great if you could share any piece of information you might have found to solve it. Many thanks!

ADD REPLY
0
Entering edit mode

@gajkun I pretty much stopped (never really got to use) samtools mpile up. Mainly because it was slow and I found another (easier in my view) solution. Evgeniia (below) gave very good anwser on how to use samtools mpileup with bcftools. I don't do a lot of varinat calling, but when I do I mostly use freebayes https://github.com/ekg/freebayes

ADD REPLY
12
Entering edit mode
9.4 years ago

I used samtools 1.1 and bcftools 1.1 with the following commands to call SNP and indels. And it works fine:

  1. To call only SNPs:

    samtools mpileup --skip-indels -d 250 -m 1 -E --BCF --output-tags DP,DV,DP4,SP -f <reference genome.fa> -o <output.bcf> <list of input bam files>
    bcftools index  <output.bcf> <indexed.bcf>
    bcftools call --skip-variants indels --multiallelic-caller --variants-only  -O v <output.bcf> -o <output.vcf>
    
  2. To call only Indels:

    samtools mpileup -d 250 -m 1 -E --BCF --output-tags DP,DV,DP4,SP -f <reference genome.fa> -o <output.bcf> <list of input bam files>
    bcftools index  <output.bcf> <indexed.bcf>
    bcftools call --skip-variants snps --multiallelic-caller --variants-only  -O v <output.bcf> -o <output.vcf>
    
  3. To call both SNPs and indels:

    samtools mpileup -d 250 -m 1 -E --BCF --output-tags DP,DV,DP4,SP -f <reference genome.fa> -o <output.bcf> <list of input bam files>
    bcftools index  <output.bcf> <indexed.bcf>
    bcftools call --multiallelic-caller --variants-only  -O v <output.bcf> -o <output.vcf>
    

You can reed more about options there --> SNP call using bcftools or in the samtools/bcftools manuals.

I hope, it helps a little.

ADD COMMENT
0
Entering edit mode

@Evgeniia thanks for the answer. I do appreciate it. With samtools mpileup options you suggested, a lot of flags are options, the way I understand... e.g -d default to 250 and -m to 1 so they are actually redundant in this case. Also I'm not sure if I need to index my file before bcftools call - do I? I followed your option 2 exactly, but still can't get my InDel that I get if I run HaplotypeCaller. Can you explain to me why am I not getting the same result and also what is the difference between "multiallelic-caller" (this is what you suggested) and another option bcftools call offers "consensus-caller"?

Thanks

ADD REPLY
0
Entering edit mode

The consensus-caller assumes only biallelic sites. Multiallelic caller - multiallelic sites.

You can reads about the difference between biallelic and multiallelic sites here --> http://gatkforums.broadinstitute.org/gatk/discussion/6455/biallelic-vs-multiallelic-sites

ADD REPLY
0
Entering edit mode
7.5 years ago

Hi,

Not sure how helpful this is, but I think the disappearing deletions and insertions are due to indels only being called for sites that do not have another type of mutation (i.e. mismatch).

Anyone know how to get mpileup and bcftools to report both an indel and a mismatch at the same site regardless of coverage or percent representation of the mutation (basically to just output everything)?

ADD COMMENT
0
Entering edit mode

BBMap's variant-caller has the option to call everything:

callvariants.sh in=mapped.sam out=vars.vcf ref=ref.fasta ploidy=2 clearfilters

By default, though, it does not mask variants because they coincide, so you don't necessarily need "clearfilters". Additionally, there's a "rarity" flag; e.g. "rarity=0.05" would not penalize variants for having a low coverage depth unless the depth was below a 0.05 allele fraction.

ADD REPLY
0
Entering edit mode

Awesome! I will give it a try. Thanks!

ADD REPLY
0
Entering edit mode

Hello,

I went ahead and had BBtools installed on our university server (there is a lot of cool stuff in there!). I ran the the following command:

callvariants.sh in=C4_DMS_genome_merge_sorted.sam out=bbtools_out.txt ref=~/Datafiles/human/genome.fa ploidy=2 rarity=0.00001 clearfilters (also tried rarity=0)

and got the following error:

Exception in thread "Thread-5" java.lang.AssertionError: 45
mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmC
MD:Z:45C104
    at stream.MDWalker.fixMatch(MDWalker.java:83)
    at stream.SamLine.toShortMatch(SamLine.java:1291)
    at stream.SamLine.toRead(SamLine.java:1971)
    at stream.SamLine.toRead(SamLine.java:1831)
    at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:270)
    at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:133)
Exception in thread "Thread-3" java.lang.AssertionError: 91
CCCCCCCmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmDDDDDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmDDDDDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmm
MD:Z:20A63C59
    at stream.MDWalker.fixMatch(MDWalker.java:83)
    at stream.SamLine.toShortMatch(SamLine.java:1291)
    at stream.SamLine.toRead(SamLine.java:1971)
    at stream.SamLine.toRead(SamLine.java:1831)
    at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:270)
    at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:133)
Exception in thread "Thread-4" java.lang.AssertionError: 55
mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmmmmmmmmmmmmmmm
MD:Z:55C33C1T0A1C25C30
    at stream.MDWalker.fixMatch(MDWalker.java:83)
    at stream.SamLine.toShortMatch(SamLine.java:1291)
    at stream.SamLine.toRead(SamLine.java:1971)
    at stream.SamLine.toRead(SamLine.java:1831)
    at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:270)
    at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:133)

Any ideas? Thanks so much!

ADD REPLY
0
Entering edit mode

It appears that your sam file is corrupt; the MD tags are incorrect. How were they generated? For those reads which according to the cigar string appear to contain deletions, there is no corresponding "^" symbol in the MD tag. You might be able to fix them with samtools "callmd" function. Unfortunately my assertion messages are not sufficient in this case; I really should have printed out the cigar string and read name so it would be easier to see what's going on and replicate. Perhaps you could use grep to fish out a couple of those lines? For example:

grep MD:Z:55C33C1T0A1C25C30 C4_DMS_genome_merge_sorted.sam
ADD REPLY

Login before adding your answer.

Traffic: 2016 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6