Counting read spacing efficiently
2
0
Entering edit mode
9.9 years ago

I'm looking for the most efficient (or at least more efficient than mine) way to count distances between reads from a bam file, such as the following figure (Winter et. al. 2013 Genome Research).

< image not found >

Since I'm only interested in distances smaller than 200bp, my current attempt is to get a window around each unique read and then to ask the absolute distance from all reads in that window regarding the window center.

Although I've embedded this in parallel code, it is still quite inefficient.

Any ideas on how to improve this?

Here's a gist with my code:

sequencing alignment DNase-seq parallel • 2.3k views
ADD COMMENT
0
Entering edit mode

Are your BAM files sorted? If so, the whole process can be done in a much more efficient manner. Even if they're not sorted, it could well be faster to sort them and then compute what you need.

ADD REPLY
0
Entering edit mode

Yes, they're sorted and indexed. HTSeq uses the index when I do bam[window] to retrieve reads in that window. However I also thought I should take more advantage of the sorted/indexed bam to count the distances. How would I go about doing that?

ADD REPLY
0
Entering edit mode

I wouldn't bother with htseq at all. Just use pysam to iterate over each alignment. You just want to know (A) the bounds of each alignment and (B) the bounds of the alignments before/after it. From that, you can determine the distance to the nearest alignment and just increment a counter vector appropriately. This should require only a single pass through the file and only a few bytes of memory.

The slowness in your current method is due to storing each window and then seeing for each alignment if its window is already in there. That's a huge number of comparisons that simply don't need to be made.

ADD REPLY
1
Entering edit mode

Counting the distance to the nearest read (for all reads) is obviously the simplest way of doing it. Yet, You'd lose information regarding other reads which would be near within a window with the distance of interest (~200bp) and I'm not sure that was the way the figure above was made.

I think I'll do it in this simpler way and see how it compares. Thanks for the input.

ADD REPLY
0
Entering edit mode

Ah, yeah, if you want a count of all alignments within a window around all other alignments then your implementation isn't that bad. You could make it more efficient in a similar manner to that I previously suggested, but honestly the time it'd take to write and debug that is probably longer than the time required to run what you already have. So unless you have to do this a lot or want to publish it as part of something else then you probably already have a good enough solution.

ADD REPLY
1
Entering edit mode
9.9 years ago

For future reference (in case anyone faces the same problem) here's an improved version:

  • Make genome-wide windows with required width (400bp in my case).
  • Get pairwise combinations of reads in window (making use of indexed bam file), and count absolute distances between.

Subsampling windows or reads to have a first look is of course reasonable since data will start looking like the overlap picture at some (2-10) millions of reads.

Code is here:

ADD COMMENT
0
Entering edit mode
9.9 years ago

Here's another approach that uses parallelization and sorting (and then more parallelization!). Splitting the work up into smaller and more efficient pieces should make this job much, much faster.

If your BAM file is indexed and you have an SGE or OGE computational grid, you could parallelize conversion of BAM to sorted BED:

$ bam2bed_sge reads.bam reads.bed

If you prefer to use GNU Parallel, there is also bam2bed_gnuParallel.

Splitting the conversion work by chromosome reduces the execution time generally to the largest chromosome of reads.

Then look for the closest read to each read and only report distances:

$ closest-features --closest --dist --no-ref --no-overlaps reads.bed reads.bed > dists.txt

The --no-ref and --no-overlaps options leave out the read element and only reports the closest signed distance between a read and its nearest read, up or downstream.

Sorting with BEDOPS eliminates wasteful pairwise comparisons between all reads and makes this task fast.

You can again parallelize this work with bedextract and the addition of the --chrom operator to closest-features. Again, supposing an SGE computational grid:

for chrName in `bedextract --list-chr reads.bed`; do \
    qsub <grid_options> closest-features --chrom ${chrName} --closest --dist --no-ref --no-overlaps reads.bed reads.bed > dists.${chrName}.txt; \
done

As with the BAM conversion step, this reduces search time down generally to the largest chromosome of reads.

Then process dists.txt (or each per-chromosome distances file, if parallelized) to filter out distances larger than some threshold (say, 200), remove the sign, and (say) report the average distance of thresholded distances:

$ awk ' \
    function abs(x) { \
        return ((x < 0.0) ? -x : x) \
    } \
    BEGIN { \
        s = 0; \
        n = 0; \
    } \
    { \
        if ($0 < 200) { s += abs($0); n++; } \
    } \
    END { \
        if (n > 0) { print s / n; } else { print "NaN"; } \
    }' dists.txt
ADD COMMENT

Login before adding your answer.

Traffic: 2119 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6