Hi,
I am using the standard mpileup/bcftools route to call variants with some publicly available Ion Torrent E.coli DH10B data. In my final results I get a couple of SNPs called where the reference base is an N character. The problem is that I have checked the entire genome and can't find any N characters! The bam file I am using was also publicly available so I am wondering if maybe the alignments were originally done against a version of the genome that contained Ns however I am not certain on the mechanics of how samtools/bcftools call the variants so am unsure if this explanation could be correct. Can anyone lend their opinion?
Thanks in advance.
If mpileup thinks that every single position is an N, that means that there's some discrepancy between the reference sequence you used, and what mpileup thinks you used. For instance, if the reference name in the fasta differs from the name in the .sam file, or if the fasta.fai was made wrong (or if that step failed, the software will yell, but it will gamely go on without it), you get that result.
I found literally two non A|T|G|C characters in the genome. These don't account for the number of variants I am seeing called against N's unfortunately...
I just performed the same operation with a publicly available MiSeq dataset for the same genome - a SNP against an N character was called for every genomic position covered by a read!
Cheers for the input. I'll have a look at it asap and will get back to you. The perils of doing too much at once!