Question

What Is The Fastest Method To Determine The Number Of Positions In A Bam File With >N Coverage?

4

Entering edit mode

11.5 years ago

Malachi Griffith 20k

I have two very large BAM files (high depth, human, whole genome). I have a seemingly simple question. I want to know how many positions in each are covered by at least N reads (say 20). For now I am not concerned about requiring a minimum mapping quality for each alignment or a minimum read quality for the reads involved.

Things I have considered:

samtools mpileup (then piped to awk to assess the minimum depth requirement, then piped to wc -l). This seemed slow...
samtools depth (storing the output to disk so that I can assess coverage at different cutoffs later). Even if I divide the genome into ~133 evenly sized pieces, this seems very slow...
bedtools coverage?
bedtools genomecov?
bedtools multicov?
bamtools coverage?

Any idea which of these might be fastest for this question? Something else I haven't thought of? I can use parallel processes to ensure that the performance bottleneck is disk access but want that access to be as efficient as possible. It seems that some of these tools are doing more than I need for this particular task...

bam bedtools samtools coverage depth-of-coverage depth-of-coverage • 12k views

ADD COMMENT • link updated 6.1 years ago by Paul ★ 1.5k • written 11.5 years ago by Malachi Griffith 20k

1

Entering edit mode

Of the bedtools options, genomecov is the one you want (see answer from Jordan). It is admittedly not the fastest - there are ways to speed it up, but we haven't yet done so. I'd be interested to know the runs times you face as a function of the number of reads.

ADD REPLY • link 11.5 years ago by Aaronquinlan 12k

0

Entering edit mode

All the answers were helpful but 'bedtools genomecov' was the method I wound up using in the end. 'samtools depth' has convenient output. I expect Pierre's solution is the probably the fastest.

ADD REPLY • link 11.5 years ago by Malachi Griffith 20k

score 7 · Answer 1 · 2013-05-21

Use the basic code for the Samtools API: http://samtools.sourceforge.net/sam-exam.shtml

change:

 if ((int)pos >= tmp->beg && (int)pos < tmp->end)

to

 if (n>20 && (int)pos >= tmp->beg && (int)pos < tmp->end)

compile: (sould be something like:)

gcc -O3 -o mytool mytool.c -I /path/to/samtools -L /path/to/samtools -lbam -lz

Edit: and you can run this code using one process per chromosome...

score 3 · Answer 2 · 2013-05-21

You could try bedtools genomeCoverageBed. It takes in bam file and reference genome as input and outputs the coverage stats.

Here is an example:

genomeCoverageBed -ibam file.bam -g human_g1k_v37.fasta > coverage.txt

The output file has 5 columns:

genome    0    1670110268    3101804739    0.538432
genome    1    696287280    3101804739    0.224478
genome    2    329358858    3101804739    0.106183
genome    3    144809487    3101804739    0.0466856
genome    4    66917607    3101804739    0.0215738
genome    5    32869034    3101804739    0.0105967
genome    6    17980502    3101804739    0.00579679
genome    7    10955385    3101804739    0.00353194

Column1: Genome/Chromosome name
Column2: Depth
Column3: Number of bases in the bam file with depth equal to Column2
Column4: Genome Length
Column5: Fraction of human genome with the depth equal to Column2

So, if you need coverage >= 5, in this case it would be 6.26% (you just add 0.538432+0.224478+0.106183+0.0466856+0.0215738 and subtract from 1).

I hope this is what you are looking for.

score 1 · Answer 3 · 2013-05-21

1

Entering edit mode

11.5 years ago

Zev.Kronenberg 12k

I pipe samtools depth to a C++ program I wrote to count depth X. From these counts I can quickly calculate the precent of bam covered at X or greater.

I also take missing sites into account.

chr1 1 10

chr 34 5

2-33 = 0;

ADD COMMENT • link 11.5 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Yeah, samtools depth is the method I was using. But for these massive BAM files, just running samtools depth seems to take a long time...

ADD REPLY • link 11.5 years ago by Malachi Griffith 20k

0

Entering edit mode

I've switched over to using sambamba depth, it's much faster than samtools depth and is also multithreaded. It also gives a parsed pileup like output with coverage by the type of base and has additional options to calculate coverage in overlapping widows.

ADD REPLY • link 7.7 years ago by Vasisht ▴ 190

0

Entering edit mode

I don't see an equivalent to depth or bedcov in the sambamba docs

ADD REPLY • link 7.6 years ago by ariel.balter ▴ 260

score 0 · Answer 4 · 2018-10-31

0

Entering edit mode

6.1 years ago

drkennetz ▴ 560

To update this thread, sambamba depth is a wonderful library that supports multithreading which drastically increases speed but provides the same functionality as samtools depth.

I would recommend everybody check out this package! Sambamba

ADD COMMENT • link 6.1 years ago by drkennetz ▴ 560

score 0 · Answer 5 · 2018-10-31

For very large BAM files (like WGS) I was tested Mosdepth Link. It is very fast and there are lot of option. Mosdepth can report the mean depth in 500-base windows genome-wide info under 9 minutes of user time with 3 threads.

Example of syntax:

mosdepth -n --fast-mode --by 500 sample.wgs $sample.wgs.cram