With samtools mpileup
you can use multiple .bam files as inputs. When samtools computes depth, are these files simply concatenated, or is there a special way samtools synthesizes the data from multiple .bam files?
The samtools github faq seemed to have something about this, but I wasn't exactly sure how to interpret what they were saying:
- Between single- and multi-sample variant calling, which is preferred?
By using multi-sample calling, we gain power on SNPs shared between samples, but lose power on singleton SNPs. Here is a way of thinking of this. Suppose we have 1% false positive rate (FPR) for variant calling from one sample. If we call SNPs from 100 samples separately and then combine the calls, the FPR would be around 10-20% (not 100% because more SNPs are found given 100 samples). To retain an acceptable FPR on singletons, we have to be more stringent on each sample and thus lose power. Combining single-sample calls naively would not increase power on shared SNPs. This is where multi-sample calling does better: by taking the advantage of correlation between samples, we are able to call a SNP if it appears in multiple samples, but too weak to call in each sample individually. Joint calling is particularly preferable if we have multiple low-coverage samples for which single-sample calling does not work well. It is also able to reveal some artifacts only detectable with many samples.
A similar thread here for your interest: Samtools: merge and mpileup vs mpileup alone for variant-calling with multiple BAM