What Is The Fastest Method To Determine The Number Of Positions In A Bam File With >N Coverage?
5
4
Entering edit mode
11.5 years ago

I have two very large BAM files (high depth, human, whole genome). I have a seemingly simple question. I want to know how many positions in each are covered by at least N reads (say 20). For now I am not concerned about requiring a minimum mapping quality for each alignment or a minimum read quality for the reads involved.

Things I have considered:

  • samtools mpileup (then piped to awk to assess the minimum depth requirement, then piped to wc -l). This seemed slow...
  • samtools depth (storing the output to disk so that I can assess coverage at different cutoffs later). Even if I divide the genome into ~133 evenly sized pieces, this seems very slow...
  • bedtools coverage?
  • bedtools genomecov?
  • bedtools multicov?
  • bamtools coverage?

Any idea which of these might be fastest for this question? Something else I haven't thought of? I can use parallel processes to ensure that the performance bottleneck is disk access but want that access to be as efficient as possible. It seems that some of these tools are doing more than I need for this particular task...

bam bedtools samtools coverage depth-of-coverage depth-of-coverage • 12k views
ADD COMMENT
1
Entering edit mode

Of the bedtools options, genomecov is the one you want (see answer from Jordan). It is admittedly not the fastest - there are ways to speed it up, but we haven't yet done so. I'd be interested to know the runs times you face as a function of the number of reads.

ADD REPLY
0
Entering edit mode

All the answers were helpful but 'bedtools genomecov' was the method I wound up using in the end. 'samtools depth' has convenient output. I expect Pierre's solution is the probably the fastest.

ADD REPLY
7
Entering edit mode
11.5 years ago

Use the basic code for the Samtools API: http://samtools.sourceforge.net/sam-exam.shtml

change:

 if ((int)pos >= tmp->beg && (int)pos < tmp->end)

to

 if (n>20 && (int)pos >= tmp->beg && (int)pos < tmp->end)

compile: (sould be something like:)

gcc -O3 -o mytool mytool.c -I /path/to/samtools -L /path/to/samtools -lbam -lz

Edit: and you can run this code using one process per chromosome...

ADD COMMENT
3
Entering edit mode
11.5 years ago
Jordan ★ 1.3k

You could try bedtools genomeCoverageBed. It takes in bam file and reference genome as input and outputs the coverage stats.

Here is an example:

genomeCoverageBed -ibam file.bam -g human_g1k_v37.fasta > coverage.txt

The output file has 5 columns:

genome    0    1670110268    3101804739    0.538432
genome    1    696287280    3101804739    0.224478
genome    2    329358858    3101804739    0.106183
genome    3    144809487    3101804739    0.0466856
genome    4    66917607    3101804739    0.0215738
genome    5    32869034    3101804739    0.0105967
genome    6    17980502    3101804739    0.00579679
genome    7    10955385    3101804739    0.00353194
  • Column1: Genome/Chromosome name
    Column2: Depth
    Column3: Number of bases in the bam file with depth equal to Column2
    Column4: Genome Length
    Column5: Fraction of human genome with the depth equal to Column2

So, if you need coverage >= 5, in this case it would be 6.26% (you just add 0.538432+0.224478+0.106183+0.0466856+0.0215738 and subtract from 1).

I hope this is what you are looking for.

ADD COMMENT
1
Entering edit mode

This just takes 5-10 min depending on the system you have. Never exceeded 10 min for me.

ADD REPLY
1
Entering edit mode
11.5 years ago

I pipe samtools depth to a C++ program I wrote to count depth X. From these counts I can quickly calculate the precent of bam covered at X or greater.

I also take missing sites into account.

chr1 1 10

chr 34 5

2-33 = 0;

ADD COMMENT
0
Entering edit mode

Yeah, samtools depth is the method I was using. But for these massive BAM files, just running samtools depth seems to take a long time...

ADD REPLY
0
Entering edit mode

I've switched over to using sambamba depth, it's much faster than samtools depth and is also multithreaded. It also gives a parsed pileup like output with coverage by the type of base and has additional options to calculate coverage in overlapping widows.

ADD REPLY
0
Entering edit mode

I don't see an equivalent to depth or bedcov in the sambamba docs

ADD REPLY
0
Entering edit mode
6.1 years ago
drkennetz ▴ 560

To update this thread, sambamba depth is a wonderful library that supports multithreading which drastically increases speed but provides the same functionality as samtools depth.

I would recommend everybody check out this package! Sambamba

ADD COMMENT
0
Entering edit mode
6.1 years ago
Paul ★ 1.5k

For very large BAM files (like WGS) I was tested Mosdepth Link. It is very fast and there are lot of option. Mosdepth can report the mean depth in 500-base windows genome-wide info under 9 minutes of user time with 3 threads.

Example of syntax:

mosdepth -n --fast-mode --by 500 sample.wgs $sample.wgs.cram
ADD COMMENT

Login before adding your answer.

Traffic: 1496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6