I have a BAM file which consists of sequencing reads derived from pooling 25 individuals and sequencing the pool of DNA. Variant calling, or calculating minor allele frequencies, requires one to specify that the BAM file actually consists of 25 samples, rather than one. This is usually done by specifying read group headers in the BAM header and for each read. For samtools, it seems like specifying any read group to a read is fine, and we just need to make sure the BAM file contains 25 read groups in total.
If I am correct, this would not be the case if I would call variants using, for example, the GATK framework, as stated in their manual: "a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model"
However, the bwa -R command you mention only allows us to add one read group to all reads in a BAM file. Since I have 25 pooled samples within the same BAM, this option would not be possible for me. This is similar for other read group add commands such as the one from the freebayes software: https://github.com/ekg/bamaddrg or the AddOrReplaceReadGroups command from Picard tools.
Therefore, two questions:
(a) How can I add multiple read groups to a single BAM file (simultaneously?), allowing me to do correct variant calling?
(b) Does it really matter whether I specify this for the pileup
file? It most likely does when eventually calling variants, but the pileup file does no such thing.
how do you know which read nº1 should be assigned to the group 'g1' and read nº2 should be assigned to the group 'g2' and ... etc...?
You basically don't. I think all that matters is specifying that it's a pooled sample, because this blows up the variability in assessing allele frequencies.
Yeah, that's not going to work. Each alignment itself needs to be associated to a particular read group, otherwise they'll just get ignored.
What about just modifying the ploidy settings?
Thank you for your answer. Adjusting ploidy settings is not possible within samtools mpileup, but maybe with bcftools?
Possibly, I know GATK has a ploidy option, don't know about bcftools.