Hi all,
I've used the following command to call SNPs between two species:
samtools mpileup -gf NFS_reference.fasta SFS_alignment_bowtie2.coordSorted.bam | bcftools call -Ou -mv | bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > NFS_SFS_SNPs.var.flt.vcf
The above command called ~180,000 SNPs. When I added the --skip-indels to the command below and re-submitted the job, ~700,000 SNPs were called.
samtools mpileup --skip-indels -gf NFS_reference.fasta SFS_alignment_bowtie2.coordSorted.bam | bcftools call -Ou -mv | bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > NFS_SFS_SNPs.var.flt.vcf
Does anyone have any idea as to why there is such a large difference between the number of SNPs that are called when I add the --skip-indels flag?
Thank you,
MB
Samtools ignores SNPs adjacent to indels. Or, more specifically, if there is a complex event that cannot be represented as a pure substitution or pure indel - such as GAT -> GC - Samtools will miscall it as GA -> G. I will guess that --skip-indels would unmask these hidden substitutions, though I would not have expected that to make such a big difference.
If I were you, I would look at the locations of the different calls in IGV to determine what is going on, and which version is correct (if either). I'd also suggest trying a different variant-caller to get a second opinion.