Intersecting A Set Of Bam Reads With A Set Of Coordinates
4
5
Entering edit mode
13.8 years ago
User 9996 ▴ 840

I have a set of regions in the genome (about 4 million regions, parametrized by start and end genomic cooridnates) that I want to align BAM reads into. I want to get all reads that overlap that set of coordinates.

The naive way to do this is very slow and I'd like to use interval trees. However, to use interval trees, I'd have to first index all the BAM reads into a tree, and then, for each coordinate in my set of regions, look for all the reads within the tree.

My concern is that this requires loading all of the BAM reads into memory since I have to index them into the interval tree data structure.

Is there a way around this, or a Python library that does this already? Any help would be appreciated.

Thanks.

p.s. I'd prefer to do this within Python and not use BEDTools unless there's an easy way to integrate Python with BEDTools that someone can recommend?

python sam bam next-gen sequencing • 7.3k views
ADD COMMENT
1
Entering edit mode

Can you explain why you need an interval tree? If you have an aligned BAM file, you can sort and index it, then retrieve reads in a region using 'fetch' in pysam: http://code.google.com/p/pysam/

ADD REPLY
0
Entering edit mode

fetch will be horribly inefficient. If you iterate through a list of coordinates and fetch the reads from each one, it will take forever.

ADD REPLY
7
Entering edit mode
13.8 years ago

With plenty of help from Brent Pedersen, I added a small Python/Cython interface to the BEDTools overlap detection rotines called bedtools-python. Check out this brief example which combines pysam and bedtools-python for the task you describe. Also, Ryan Dale's pybedtools is a near complete Python interface to BEDTools.

ADD COMMENT
1
Entering edit mode

Both python interfaces to bedtools are very useful! Thanks a lot for creating these tools.

ADD REPLY
0
Entering edit mode

to be clear, Ryan Dale deserves all of the credit for pybedtools. The initial release is quite good and he is working on substantial improvements.

ADD REPLY
5
Entering edit mode
13.8 years ago

See pybedtools: https://github.com/daler/pybedtools

pybedtools is a Python wrapper for Aaron Quinlan's BEDtools and is designed to leverage the "genome algebra" power of BEDtools from within Python scripts.

ADD COMMENT
0
Entering edit mode

I cannot get it to install. After running python setup.py install --home=$HOME, I try to "import pybedtools" and I get the error "File "/srv/pkg/python/python-2.6.4/lib/python2.6/subprocess.py", line 1126, in _execute_child raise child_exception"

ADD REPLY
0
Entering edit mode

Is there more to the error message? Sometimes you can get this if a command can't be found, which could happen if bedtools isn't on the path. Most helpful would be if you pasted the full traceback either here or perhaps more appropriately in an email to the bedtools mailing list.

ADD REPLY
0
Entering edit mode

For anyone else with the above error for pybedtools: make sure you've installed BEDTools and that it's available on your path

ADD REPLY
4
Entering edit mode
13.8 years ago
Martin Morgan ★ 1.6k

The question implies python, but R / Bioconductor has the IRanges and related packages. You might (1) input your reference coordinates with the rtracklayer package (for gff and similar formats) or GenomicFeatures::makeTranscriptDb (for UCSF / biomaRt tracks) or rolled by hand; (b) count overlaps between the regions and bam files using Rsamtools::countBam(), which relies on the definition of overlap in samtools. Alternatively you might use IRanges::countOverlaps or findOverlaps for more flexibility in what counts as an overlap. A worked example is in section three of the vignette "GenomicRanges Use Cases" with a few more appropriate to counting approaches sketched in "Counting Alignment Overlaps with GenomicRanges". R works with objects in memory; the reference ranges fit easily and the bam files are easy to process in reasonably sized chunks.

ADD COMMENT
1
Entering edit mode
13.8 years ago

To follow up more on my comment above, at some point you are going to have to extract the reads from your BAM file to push them into an IntervalTree or BED file. Instead of doing this extraction, then querying, you can just do the extraction directly from the indexed file. This is fairly speedy, for disk access. For instance, I extracted 4 million random 500bp positions from a human genome BAM and it takes about 2 hours:

% ls -1sh
total 3.7G
3.7G 6_110216_FC638NPAAXX-sort.bam
6.9M 6_110216_FC638NPAAXX-sort.bam.bai
4.0K fetch_regions.py
% python2.6 fetch_regions.py 6_110216_FC638NPAAXX-sort.bam 
6717.29s user 123.74s system 99% cpu 1:54:03.33 total

The code is below:

import sys
import random

import pysam

def random_regions(in_bam, size):
    regions = []
    for chrom in in_bam.header["SQ"]:
        regions.append((chrom["SN"], 1, int(chrom["LN"]) - size))
    while 1:
        space, start, end = random.choice(regions)
        pos = random.randint(start, end)
        yield space, pos, pos+size

in_file = sys.argv[1]
in_bam = pysam.Samfile(in_file, "rb")
size = 500
n = 4e6
region_gen = random_regions(in_bam, size)
for i in range(int(n)):
    if i % 10000 == 0:
        print i
    space, start, end  = region_gen.next()
    read_size = 0
    for read in in_bam.fetch(space, start, end):
        read_size += read.rlen

This is not to take away from the other excellent tools mentioned for quick overlap comparison with IntervalTrees. I use these all the time for intersection of regions, but for your question about dealing directly with reads it seems reasonable to utilize direct lookup.

(As an aside, you've been here for a while as 'unknown (google)' and asked several questions. Why not add your name and become part of the community?)

ADD COMMENT

Login before adding your answer.

Traffic: 2831 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6