Hi, I'm doing some variant-calling, and I was wondering what is the difference between:
merging BAM files with samtools merge and call samtools mpileup on the merged BAM
and
skip the merge step and call mpileup with all the BAM files?
I did both and expected the same results, but I got 15% of identical SNPs, 60% close to identical but lower quality with the samtools merge step, and the others SNPs are not find by one or the other method.
Is it possible that the quality filter applied in mpileup affect each file separately and cause this behaviour or is it something else?
More coverage gives greater confidence in variant calls, and decreases the likelihood that only a single allele will be present for heterozygous variants. I would not expect merging several 100x coverage datasets to change the result much, but merging several 5x coverage datasets will yield an extremely large difference.
But since the RG tags are identical for one sample, doesn't mpileup merge files itself? And so the coverage should be identical compared to the step with merge no?
that is the part that i don't know if mpileup treats 2 sets of RG tags as if you have passed 2 separate files to mpileup, generating 1 variant call file, but with calls for each sample. Best way just to it until getting vcf and see how many individuals you get in the end
The strange things with this is that I have more coverage on a single SNPs without merge the files before. Here one example (same for all SNPs common in both methods) :
This VCF has only one sample? Hmmm, looks to me like something strange is going on.
Yep, only one sample (same RG tag for each Lane)
What you are describing should not happen. But at least, please make sure you are using the latest versions of all the software (I assume that would be samtools and bcftools); that can often resolve issues. It might also help if you could post the exact command lines you used in each case.
that reminds me also of my old problem a bit using samtools 1.19 and bcftools 1.2. Either use both of the 1.19 or both at 1.2 versions <- you can also experiment with that, and some crazy bat crap is happening sometimes
I would just like to question this: our empirical data suggests that increased depth of coverage does not increase confidence in the calls - it is a common misconception that more is better, in terms of variant calling. Due to the fact that NGS reads contain so much error, increase depth of coverage can, I believe, actually make it more difficult for the variant callers to make an accurate call.
Our data suggests that ideal call range for germline variants is 18-30 read depth.