I have 30 + bam files and I have merged the data using samtools merge(adding RG tag using -rh option).I ran mpileup on each of the .bam files separately and ran mpileup on the merged data to look at average coverage.Say if I have 10x average coverage in each of my individual samples,I am expecting a coverage of roughly 10x*n in the merged case where n is the number of samples .However,I see my average coverage on the merged bam dropping to roughly half the value and some of the bases included in individual samples pileup are not included in the merged bam pileup.I am not sure how samtools mpileup is interpreting the merged bam file especially with RG tags or I am wondering if the merge has been done properly or not.Any help would be appreciated.
Just a thought but are these bams marked for duplicates?
The bams are not marked for duplicates
@Vivek, I need to ask if the bam files are marked for duplicates and then recalibrated using GATK , does not they work properly with the samtools mpileup?
That shouldn't be an issue. I use samtools mpileup all the time on GATK processed BAMs.
ok thanks probably the version of samtools am using is not updated. I will try with a newer version then. Thanks.
Can you post example command strings for each of the steps you describe? How exactly are you calculating average coverage?
Merge: samtools merge -rh rg.txt out.merged.bam *.sort.bam
mpileup on merged bam:
Average coverage calculations:
where sum/genome_length should give me the coverage
Also I am getting the following error when I run flag stat on merged bam
As far as my knowledge goes this shows up whenever I read uncompressed bam files.So,I think it is not an error and the data has been merged properly.Any ideas?
just a suggestion: I have systematically been more successful on analyzing merged bams when creating them using picard's MergeSamFiles rather than samtools' merge, although this well may be due to a wrong samtools usage on my side.