Is there a tool that can be used to count the number of bases with coverage >= X in each of 3 input files? (That is, if chr1:1,000,000 has 30 reads at that position in each of bams A,B,C, then that position would be counted towards the total.)
You could do this with pysam (or htslib if you prefer C) and an mpileup. This will allow you to iterate over each base in the genome and count the number of reads in each sample (possibly with filtering metrics that you can specify) that cover that base. This is relatively quick. You could also use samtools depth on each file separately (internally, this is just a pileup) and then parse the results to count the bases. This might prove to be simpler to write in fact.
I sometimes need to do something very similar, and I solve it by using bedtools. the rationale is very simple:
extract a bed file per bam file with the regions where the bam file's positions are covered above a threshold (by generating a bedgraph file, filtering it by coverage, and merging all adjacent regions)
intersect all the bed files previously obtained
the commands needed would be these ones:
THRESHOLD=30
foreach bamfile in *bam; do
bedtoools genomecov -bga -ibam $bamfile \
| awk -v t=$THRESHOLD '$4>=t' \
| bedtools merge -i - \
> ${bamfile/.bam/.above$THRESHOLD.bed}
done
cp `ls *.above$THRESHOLD.bed | head -1` bed.tmp
for bedfile in *.above$THRESHOLD.bed; do
bedtools intersect -a bed.tmp -b $bedfile > bed.tmp.2
mv bed.tmp.2 bed.tmp
done
mv bed.tmp above$THRESHOLD.intersect.bed
Off the top of my head, if I wanted to solve this problem quickly I would merge the BAMs (you could also use samtools merge) and then run Picard's CollectWgsMetrics tool. What you'll get at the end is a text file showing the number of bases at each coverage level, along with some other nice details.
This is an interesting idea, but how will I know if a base is covered in all 3 files? If a given position has 100 reads in the merged bam, is it possible to identify the extreme case where all of the reads are coming from the same sample?
Well, that's a different question. I was addressing your original post with my answer.
For this new query, I would use bedtools genomecov with each of your BAMs (and perhaps with the merged BAM as well) to ascertain the per-base coverage at each reference position.
If you were interested in specific locations of the genome, I would put those locations in a BED file and use samtools to create a BAM for just those regions. Feed that truncated BAM into the above tool and you should get your results quickly.
ADD REPLY
• link
updated 2.9 years ago by
Ram
44k
•
written 10.1 years ago by
Dan D
7.4k
0
Entering edit mode
Thanks for your response, and sorry if my original post was unclear!
Do you think the genomecov method is feasible with whole genomes?
It certainly is. The documentation for the tool details several ways you can have the output presented. I'm not sure exactly what you're wanting to get out of it at this point; read through the short documentation for that tool. If the available parameters don't meet your needs, you might be able to accomplish your goal by piping the output into a filtering script or sed/awk command. If you have any questions, post a comment on this answer or your question and I'll do what I can to help. Precise explanations of what you want will be helpful to that end :)
Yes, of course. I have 3 whole genomes (normal, pre, and post treatment) and want to know the number of bases covered to 10X in all 3 files. I am somewhat familiar with genomecov, and know that I can pipe the output to a simple awk script that will keep only those bases with coverage > X and subsequently use some FNR==NR magic to compare those files. I'm worried that the genomecov outputs will be prohibitively massive, however, even if I gzip them. I guess I was hoping for a nifty tool that could read bams and count covered bases simultaneously...
HTSeq-count would help you
http://www-huber.embl.de/users/anders/HTSeq/doc/counting.html