I have reference genome aligned data of pooled DNA sequences. Each pooled sample consisted of 25 diploid individuals. My goal is to derive minor allele frequencies for high probable SNPs or indels.
A lot of people talk about the samtools mpileup command in this context. Similar, published research regarding pooled DNA samples use this command to call variants (e.g. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0080422). However, reading the manual page of the command, it does not have an option or flag to specify we are dealing with pooled samples. I have read quite some posts on this and other forums telling me that accounting for the pooling effect is important (due to subsampling of chromosomes, population inference etc.).
Since there is no option to specify the pooled samples, would simply running the command
samtools mpileup -B -Q 20 -f genome.fa pop.bam pop2.bam > p1-2.mpileup
result in valid variant calling of which I will be able to derive valid minor allele frequencies over the genome for both pooled samples? I realise some post-hoc filtering based on probability of a real variant etc. is necessary.
Or do I need to specify a specific header in my BAM file telling samtools that we are dealing with 25 samples aggregated into one BAM file?
Thank you for your answer. Does it matter which group I assign to a read or does it only matter that I assign 25 groups in total?
As long as each read is linked to a group ID, everything is ok.
Thank you again. So, it seems like specifying any read group to a read is fine, and we just need to make sure the BAM file contains 25 read groups in total.
If I am correct, this would not be the case if I would call variants using, for example, the GATK framework, as stated in their manual:
However, the
bwa -R
command you mention only allows us to add one read group to all reads in a BAM file. Since I have 25 pooled samples within the same BAM, this option would not be possible for me. This is similar for other read group add commands such as the one from the freebayes software: https://github.com/ekg/bamaddrg