Reducing I/O Burden Of Bamtools C++ Api
2
1
Entering edit mode
12.1 years ago
Mitch Bekritsky ★ 1.3k

Hi,

I have some C++ code I've written that uses the BamTools API to count the number of reads that overlap a set of intervals in hg19. I'm trying to run this on ~400 BAM files using SGE (24 threads per node), but the I/O burden is wreaking havoc with the cluster I'm using. Currently, my code does something along the lines of:

    BamReader reader;
    reader . Open(bamName);

      while(reader . GetNextAlignment(al))
    {
     do stuff...
    }

Which is the most straightforward use of the API to read BAM files I can imagine.

My guess is that BamTools provides random access to the file, and all I'm looking for is to read the BAM line by line. Is there any way to do this with the API, or am I better off opening a pipe to samtools view?

For a number of reasons, I cannot use GATK DepthOfCoverage or SamTools pileup since I need the read names for my coverage analysis.

Thanks!

EDIT: I am using these BAM files to compare interval coverage from another sequencing pipeline to one I am developing.

The BAM files I am looking at come from a different project that has alignment data for families. There is one BAM file per family, with each member identifiable by a set of read group IDs corresponding to different flow cells. These files already come to me sorted and indexed.

I'm looking at coverage for about 6M short intervals throughout hg19, ranging from ~10 bp to ~200 bp.

I specifically need the read IDs because I have noticed that some read fragments are small enough that the same interval is present in both read pairs. If I didn't have read IDs, I wouldn't be able to filter reads that cover the same region twice.

• 6.2k views
ADD COMMENT
2
Entering edit mode

I can't quite tell from your rough example, but are you using the BamTools API to seek to the positions you're interesting in? This will require a sorted and indexed bam file. If you're not, and you're checking for interval overlaps in the inner loop ("do stuff"), you could be wasting a lot of work and IO (depending on how sparse the intervals are).

ADD REPLY
0
Entering edit mode

I'm not seeking within the BAM file, just streaming it in line by line. You make a great point that I could easily be using intervals instead.

ADD REPLY
2
Entering edit mode

The following code in bedtools (for the multibamcov tool) uses a MultiReader to seek to regions identified by specific BED entries and count coverage. This may help. https://github.com/arq5x/bedtools/blob/master/src/multiBamCov/multiBamCov.cpp#L106-L215

ADD REPLY
0
Entering edit mode

If I end up using intervals, this is definitely the way to do it. I'll give this a shot unless I or someone else come up with a better idea. Thanks!

ADD REPLY
2
Entering edit mode

"... SamTools pileup since I need the read names for my coverage analysis." but could you use samtools view -L my.bed my.bam ?

ADD REPLY
0
Entering edit mode

Thanks Pierre for the idea. Unfortunately, I didn't initially mention that I need the read IDs to filter overlapping read pairs. But I'm glad to learn about another useful samtools feature :)

ADD REPLY
0
Entering edit mode

bug you'll get the read IDs with "samtools view"

ADD REPLY
0
Entering edit mode

Whoops...misunderstood your comment, Pierre. I'm looking at my local documentation for samtools and the documentation and sourceforge and don't see anything about a -L flag with bed coordinates. Can you point me to the samtools version that has this option?

ADD REPLY
2
Entering edit mode

$ samtools-0.1.18/samtools view

Usage:   samtools view [options] <in.bam>|<in.sam> [region1 [...]]
(...)
     -L FILE  output alignments overlapping the input BED FILE [null]
ADD REPLY
0
Entering edit mode

Thanks again for the tip...I just timed samtools view with -L and it worked really well and fairly quickly

ADD REPLY
1
Entering edit mode

After trying using BamTools with SetRegion, the time to get the coverage at all 6 million loci went from less than half an hour to almost 6 hours.

Thanks to Pierre though, I looked into the SamTools C API and Heng Li's code for the latest version of SamTools (which has the -L flag, almost exactly like what I'm looking for) and was able to use that code to get coverage at these intervals according to my specifications while limiting how much time the code takes. It now runs in about 17 minutes.

If anyone is interested in the code, let me know, but it's really just a derivation of sam_view from samtools-0.1.18 by Heng Li.

Thanks again for everyone's help!

ADD REPLY
0
Entering edit mode

Can you split your set of intervals an run your program for each sub-set on a different node ?

ADD REPLY
0
Entering edit mode

I can split by interval, but since I'm already running 400+ BAM files in parallel on a 16 node cluster, I'd rather not increase the number of jobs I'm submitting by an order of magnitude.

ADD REPLY
0
Entering edit mode

Are you not actually using the BamMultiReader? And based on your use case, it sounds like you are seeking to specific intervals. This would likely lead to a bunch of I/O ops, especially if you are interrogating many intervals. Worst case, you will do 400 seeks for each new interval.

ADD REPLY
0
Entering edit mode

Since it's only one file per person, I don't need BamMultiReader. However, you are right, seeking to intervals is probably the way to go. I guess I'm still a little concerned that it could be a huge IO draw while it's running, but it can just run in a much shorter period of time. If I limit the number of jobs that can be submitted at once, it should be fine though.

ADD REPLY
2
Entering edit mode
12.1 years ago
Erik Garrison ★ 2.4k

As others have said, if you don't need to read the whole file, just use the BamReader.set region() call to specify a target region.

Alternatively, you can open stdin as a BAM file in BamTools (pass std::cin to open), and then pipe BAM into your script using any external method to open the BAMs in the target regions.

You can also limit parallelism using appropriate measures for your setup. When I just want to run a few dozen parallel IO-bound threads I use gnu parallel, and run everything from a single workstation.

ADD COMMENT
0
Entering edit mode

Thanks Erik! I'm using setRegion to specify targets now, which is an improvement on the outsized I/O burden of my original code. Unfortunately, now things seem to be running a bit slower, so I'm looking into how to speed up the processing.

The next thing on my list of things to try is piping the BAM in, but I'm still hoping I can avoid that and use BamTools if I can.

As for parallelism, I'm going to limit my jobs to a few per node at once as soon as I get this working with some decent speed. I'd rather go through SGE instead though so that it can manage my resources and only submit my jobs to nodes where things are kinda quiet.

ADD REPLY
1
Entering edit mode
12.1 years ago
Marvin ▴ 900

There is no sensible advise to give except for don't run this code 400 fold in parallel!

More precisely, if you're only interested in specific regions, you should definitely use the index and seek within the BAM files (reader.SetRegion). This reduces the I/O load substantially. However, chances are your code is doing very little processing, so you are running 400 jobs that demand data from the file system as fast as it can deliver, your filesystem is NFS or similar, and there is simply no network bandwidth to support it.

How do I know that? I see it almost daily: colleagues (biologists with no CS training) write jobs that compute almost nothing but read enormous amounts of data. They start 400 of them on an SGE managed cluster and the whole NFS becomes unusable for everyone. Running the same task serially on their workstation(!) is ususally faster.

Short term, the only thing you can do is run fewer jobs in parallel. Long term, you need a distributed file system---or more compact data format and better data management, but I am not holding my breath for that to happen.

ADD COMMENT
1
Entering edit mode

Hi Marvin,

While your suggestion about using setRegion is sensible, and I have implemented it as per Aaron's comment, I really do not appreciate your condescension with regards to my perceived experience or lack thereof when it comes to coding.

The jobs I'm running are managed through SGE on a distributed file system, they are not run in parallel as 400 jobs at once. In practice, about 100 jobs are run simultaneously across 16 computers with 24 nodes apiece. The NFS calls to the hard drive will only come from local processes (if the BAM file I'm working on is on cluster 4, the process accessing that file will be on the same cluster). I would never put 400 jobs on a cluster at once, under any circumstances, and would always be certain to make sure that the code is localized to the proper node.

And running the jobs serially, which I have tried, is no faster.

If you're going to offer advice, I'll thank you to do it politely in the future. If I have misinterpreted your intent, I wholeheartedly apologize in advance.

ADD REPLY

Login before adding your answer.

Traffic: 1930 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6