Question

Reducing I/O Burden Of Bamtools C++ Api

1

Entering edit mode

12.0 years ago

Mitch Bekritsky ★ 1.3k

Hi,

I have some C++ code I've written that uses the BamTools API to count the number of reads that overlap a set of intervals in hg19. I'm trying to run this on ~400 BAM files using SGE (24 threads per node), but the I/O burden is wreaking havoc with the cluster I'm using. Currently, my code does something along the lines of:

    BamReader reader;
    reader . Open(bamName);

      while(reader . GetNextAlignment(al))
    {
     do stuff...
    }

Which is the most straightforward use of the API to read BAM files I can imagine.

My guess is that BamTools provides random access to the file, and all I'm looking for is to read the BAM line by line. Is there any way to do this with the API, or am I better off opening a pipe to samtools view?

For a number of reasons, I cannot use GATK DepthOfCoverage or SamTools pileup since I need the read names for my coverage analysis.

Thanks!

EDIT: I am using these BAM files to compare interval coverage from another sequencing pipeline to one I am developing.

The BAM files I am looking at come from a different project that has alignment data for families. There is one BAM file per family, with each member identifiable by a set of read group IDs corresponding to different flow cells. These files already come to me sorted and indexed.

I'm looking at coverage for about 6M short intervals throughout hg19, ranging from ~10 bp to ~200 bp.

I specifically need the read IDs because I have noticed that some read fragments are small enough that the same interval is present in both read pairs. If I didn't have read IDs, I wouldn't be able to filter reads that cover the same region twice.

• 6.1k views

ADD COMMENT • link updated 12.0 years ago by Erik Garrison ★ 2.4k • written 12.0 years ago by Mitch Bekritsky ★ 1.3k

2

Entering edit mode

I can't quite tell from your rough example, but are you using the BamTools API to seek to the positions you're interesting in? This will require a sorted and indexed bam file. If you're not, and you're checking for interval overlaps in the inner loop ("do stuff"), you could be wasting a lot of work and IO (depending on how sparse the intervals are).

ADD REPLY • link 12.0 years ago by matted 7.8k

0

Entering edit mode

I'm not seeking within the BAM file, just streaming it in line by line. You make a great point that I could easily be using intervals instead.

ADD REPLY • link 12.0 years ago by Mitch Bekritsky ★ 1.3k

2

Entering edit mode

The following code in bedtools (for the multibamcov tool) uses a MultiReader to seek to regions identified by specific BED entries and count coverage. This may help. https://github.com/arq5x/bedtools/blob/master/src/multiBamCov/multiBamCov.cpp#L106-L215

ADD REPLY • link 12.0 years ago by Aaronquinlan 12k

0

Entering edit mode

If I end up using intervals, this is definitely the way to do it. I'll give this a shot unless I or someone else come up with a better idea. Thanks!

ADD REPLY • link 12.0 years ago by Mitch Bekritsky ★ 1.3k

2

Entering edit mode

"... SamTools pileup since I need the read names for my coverage analysis." but could you use samtools view -L my.bed my.bam ?

ADD REPLY • link 12.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks Pierre for the idea. Unfortunately, I didn't initially mention that I need the read IDs to filter overlapping read pairs. But I'm glad to learn about another useful samtools feature :)

ADD REPLY • link 12.0 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

bug you'll get the read IDs with "samtools view"

ADD REPLY • link 12.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Whoops...misunderstood your comment, Pierre. I'm looking at my local documentation for samtools and the documentation and sourceforge and don't see anything about a -L flag with bed coordinates. Can you point me to the samtools version that has this option?

ADD REPLY • link 12.0 years ago by Mitch Bekritsky ★ 1.3k

2

Entering edit mode

$ samtools-0.1.18/samtools view

Usage:   samtools view [options] <in.bam>|<in.sam> [region1 [...]]
(...)
     -L FILE  output alignments overlapping the input BED FILE [null]

ADD REPLY • link 12.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks again for the tip...I just timed samtools view with -L and it worked really well and fairly quickly

ADD REPLY • link 12.0 years ago by Mitch Bekritsky ★ 1.3k

1

Entering edit mode

After trying using BamTools with SetRegion, the time to get the coverage at all 6 million loci went from less than half an hour to almost 6 hours.

Thanks to Pierre though, I looked into the SamTools C API and Heng Li's code for the latest version of SamTools (which has the -L flag, almost exactly like what I'm looking for) and was able to use that code to get coverage at these intervals according to my specifications while limiting how much time the code takes. It now runs in about 17 minutes.

If anyone is interested in the code, let me know, but it's really just a derivation of sam_view from samtools-0.1.18 by Heng Li.

Thanks again for everyone's help!

ADD REPLY • link 12.0 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

Can you split your set of intervals an run your program for each sub-set on a different node ?

ADD REPLY • link 12.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I can split by interval, but since I'm already running 400+ BAM files in parallel on a 16 node cluster, I'd rather not increase the number of jobs I'm submitting by an order of magnitude.

ADD REPLY • link 12.0 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

Are you not actually using the BamMultiReader? And based on your use case, it sounds like you are seeking to specific intervals. This would likely lead to a bunch of I/O ops, especially if you are interrogating many intervals. Worst case, you will do 400 seeks for each new interval.

ADD REPLY • link 12.0 years ago by Aaronquinlan 12k

0

Entering edit mode

Since it's only one file per person, I don't need BamMultiReader. However, you are right, seeking to intervals is probably the way to go. I guess I'm still a little concerned that it could be a huge IO draw while it's running, but it can just run in a much shorter period of time. If I limit the number of jobs that can be submitted at once, it should be fine though.

ADD REPLY • link 12.0 years ago by Mitch Bekritsky ★ 1.3k

score 2 · Answer 1 · 2012-12-03

2

Entering edit mode

12.0 years ago

Erik Garrison ★ 2.4k

As others have said, if you don't need to read the whole file, just use the BamReader.set region() call to specify a target region.

Alternatively, you can open stdin as a BAM file in BamTools (pass std::cin to open), and then pipe BAM into your script using any external method to open the BAMs in the target regions.

You can also limit parallelism using appropriate measures for your setup. When I just want to run a few dozen parallel IO-bound threads I use gnu parallel, and run everything from a single workstation.

ADD COMMENT • link 12.0 years ago by Erik Garrison ★ 2.4k

0

Entering edit mode

Thanks Erik! I'm using setRegion to specify targets now, which is an improvement on the outsized I/O burden of my original code. Unfortunately, now things seem to be running a bit slower, so I'm looking into how to speed up the processing.

The next thing on my list of things to try is piping the BAM in, but I'm still hoping I can avoid that and use BamTools if I can.

As for parallelism, I'm going to limit my jobs to a few per node at once as soon as I get this working with some decent speed. I'd rather go through SGE instead though so that it can manage my resources and only submit my jobs to nodes where things are kinda quiet.

ADD REPLY • link 12.0 years ago by Mitch Bekritsky ★ 1.3k

score 1 · Answer 2 · 2012-12-02

There is no sensible advise to give except for don't run this code 400 fold in parallel!

More precisely, if you're only interested in specific regions, you should definitely use the index and seek within the BAM files (reader.SetRegion). This reduces the I/O load substantially. However, chances are your code is doing very little processing, so you are running 400 jobs that demand data from the file system as fast as it can deliver, your filesystem is NFS or similar, and there is simply no network bandwidth to support it.

How do I know that? I see it almost daily: colleagues (biologists with no CS training) write jobs that compute almost nothing but read enormous amounts of data. They start 400 of them on an SGE managed cluster and the whole NFS becomes unusable for everyone. Running the same task serially on their workstation(!) is ususally faster.

Short term, the only thing you can do is run fewer jobs in parallel. Long term, you need a distributed file system---or more compact data format and better data management, but I am not holding my breath for that to happen.