Question

samtools mpileup on pooled samples

0

Entering edit mode

9.8 years ago

koen.vdberge ▴ 30

I have reference genome aligned data of pooled DNA sequences. Each pooled sample consisted of 25 diploid individuals. My goal is to derive minor allele frequencies for high probable SNPs or indels.

A lot of people talk about the samtools mpileup command in this context. Similar, published research regarding pooled DNA samples use this command to call variants (e.g. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0080422). However, reading the manual page of the command, it does not have an option or flag to specify we are dealing with pooled samples. I have read quite some posts on this and other forums telling me that accounting for the pooling effect is important (due to subsampling of chromosomes, population inference etc.).

Since there is no option to specify the pooled samples, would simply running the command

samtools mpileup -B -Q 20 -f genome.fa pop.bam pop2.bam > p1-2.mpileup

result in valid variant calling of which I will be able to derive valid minor allele frequencies over the genome for both pooled samples? I realise some post-hoc filtering based on probability of a real variant etc. is necessary.

Or do I need to specify a specific header in my BAM file telling samtools that we are dealing with 25 samples aggregated into one BAM file?

variant samtools • 3.8k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by koen.vdberge ▴ 30

Ram · Accepted Answer · 2015-10-01

2

Entering edit mode

9.8 years ago

Pierre Lindenbaum 166k

You can identify your reads by assigning a read group (RG) to your reads. for example, see option -R in bwa: http://bio-bwa.sourceforge.net/bwa.shtml

-R STR Complete read group header line. '\t' can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is '@RG\tID:foo\tSM:bar'. [null]

mpileup handles groups. Depending of your needs you can create 25 reads groups with one sample per group or 25 reads groups with the same sample, etc...

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thank you for your answer. Does it matter which group I assign to a read or does it only matter that I assign 25 groups in total?

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by koen.vdberge ▴ 30

0

Entering edit mode

As long as each read is linked to a group ID, everything is ok.

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thank you again. So, it seems like specifying any read group to a read is fine, and we just need to make sure the BAM file contains 25 read groups in total.

If I am correct, this would not be the case if I would call variants using, for example, the GATK framework, as stated in their manual:

a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model

However, the bwa -R command you mention only allows us to add one read group to all reads in a BAM file. Since I have 25 pooled samples within the same BAM, this option would not be possible for me. This is similar for other read group add commands such as the one from the freebayes software: https://github.com/ekg/bamaddrg

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by koen.vdberge ▴ 30