Mpileup On Merged .Bam File
1
2
Entering edit mode
12.3 years ago
Kssr ▴ 110

I have 30 + bam files and I have merged the data using samtools merge(adding RG tag using -rh option).I ran mpileup on each of the .bam files separately and ran mpileup on the merged data to look at average coverage.Say if I have 10x average coverage in each of my individual samples,I am expecting a coverage of roughly 10x*n in the merged case where n is the number of samples .However,I see my average coverage on the merged bam dropping to roughly half the value and some of the bases included in individual samples pileup are not included in the merged bam pileup.I am not sure how samtools mpileup is interpreting the merged bam file especially with RG tags or I am wondering if the merge has been done properly or not.Any help would be appreciated.

samtools coverage bam • 11k views
ADD COMMENT
0
Entering edit mode

Just a thought but are these bams marked for duplicates?

ADD REPLY
0
Entering edit mode

The bams are not marked for duplicates

ADD REPLY
0
Entering edit mode

@Vivek, I need to ask if the bam files are marked for duplicates and then recalibrated using GATK , does not they work properly with the samtools mpileup?

ADD REPLY
0
Entering edit mode

That shouldn't be an issue. I use samtools mpileup all the time on GATK processed BAMs.

ADD REPLY
0
Entering edit mode

ok thanks probably the version of samtools am using is not updated. I will try with a newer version then. Thanks.

ADD REPLY
0
Entering edit mode

Can you post example command strings for each of the steps you describe? How exactly are you calculating average coverage?

ADD REPLY
0
Entering edit mode

Merge: samtools merge -rh rg.txt out.merged.bam *.sort.bam

mpileup on merged bam:

samtools mpileup -D -C 50 -q 20 -Q 20 -f ref.fasta  out.merged.bam >merged.mpileup

Average coverage calculations:

awk '{sum+=$4} END {print sum,"\t",sum/genome_length,"\t",NR}'  merged.mpileup

where sum/genome_length should give me the coverage

Also I am getting the following error when I run flag stat on merged bam

samtools flagstat out.merged.bam
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_flagstat_core] Truncated file? Continue anyway.

As far as my knowledge goes this shows up whenever I read uncompressed bam files.So,I think it is not an error and the data has been merged properly.Any ideas?

ADD REPLY
0
Entering edit mode

just a suggestion: I have systematically been more successful on analyzing merged bams when creating them using picard's MergeSamFiles rather than samtools' merge, although this well may be due to a wrong samtools usage on my side.

ADD REPLY
0
Entering edit mode
12.3 years ago
matted 7.8k

Is your sequencing very high coverage? When you run samtools mpileup, it should say something like:

[mpileup] 200 samples in 200 input files
<mpileup> Set max per-file depth to 40

It only outputs the max depth message if it has to raise it from the default, or if you request a value that will result in high memory usage, like:

[mpileup] 200 samples in 200 input files
(mpileup) Max depth is above 1M. Potential memory hog!

What does it say in your case? I think the limit will be 8000 / # samples, which will cause you problems if coverage gets above 266 in a sample (for n=30).

Try changing to higher values of -d, or better yet, use samtools depth. It will do most of the filtering you request in your example (it has -q and -Q), and it shouldn't have any depth constraints (since it doesn't call variants).

I believe the truncated file warning can be ignored and is an artifact from writing uncompressed BAM files. I think the latest version of samtools (or the git version) fixes this.

ADD COMMENT
0
Entering edit mode

This is what I get when I run mpileup on individual and merged data respectively [mpileup] 1 samples in 1 input files <mpileup> Set max per-sample depth to 8000.

[mpileup] 36 samples in 1 input files

My individual data have avg coverages of roughly 10x and some bases have really high coverage(>8000 and some of them above 200).I am thinking the coverage you are talking about is the avg coverage/depth per sample but not per base position.I am still not clear about the depth constraints in samtools.

I tried samtools depth and it is giving me the expected coverage.GATK depthofcoverage and bedtools genomecoverage bed give similar results(doesn't include some of the bases though).

ADD REPLY
0
Entering edit mode

The max depths are per-base, since samtools has to hold all local reads in memory to do variant calling around a locus. So if you have areas of high coverage, this could be causing the difference.

And to clarify, did samtools depth give you the correct answer?

ADD REPLY
0
Entering edit mode

Thanks for the clarification regarding depth.Samtools depth is giving me the correct answer.I checked this with coverage from GATK DepthOfCoverage and BEDTools as well.

ADD REPLY

Login before adding your answer.

Traffic: 2049 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6