Here's another approach that uses parallelization and sorting (and then more parallelization!). Splitting the work up into smaller and more efficient pieces should make this job much, much faster.
If your BAM file is indexed and you have an SGE or OGE computational grid, you could parallelize conversion of BAM to sorted BED:
$ bam2bed_sge reads.bam reads.bed
If you prefer to use GNU Parallel, there is also bam2bed_gnuParallel
.
Splitting the conversion work by chromosome reduces the execution time generally to the largest chromosome of reads.
Then look for the closest read to each read and only report distances:
$ closest-features --closest --dist --no-ref --no-overlaps reads.bed reads.bed > dists.txt
The --no-ref
and --no-overlaps
options leave out the read element and only reports the closest signed distance between a read and its nearest read, up or downstream.
Sorting with BEDOPS eliminates wasteful pairwise comparisons between all reads and makes this task fast.
You can again parallelize this work with bedextract and the addition of the --chrom
operator to closest-features
. Again, supposing an SGE computational grid:
for chrName in `bedextract --list-chr reads.bed`; do \
qsub <grid_options> closest-features --chrom ${chrName} --closest --dist --no-ref --no-overlaps reads.bed reads.bed > dists.${chrName}.txt; \
done
As with the BAM conversion step, this reduces search time down generally to the largest chromosome of reads.
Then process dists.txt
(or each per-chromosome distances file, if parallelized) to filter out distances larger than some threshold (say, 200), remove the sign, and (say) report the average distance of thresholded distances:
$ awk ' \
function abs(x) { \
return ((x < 0.0) ? -x : x) \
} \
BEGIN { \
s = 0; \
n = 0; \
} \
{ \
if ($0 < 200) { s += abs($0); n++; } \
} \
END { \
if (n > 0) { print s / n; } else { print "NaN"; } \
}' dists.txt
Are your BAM files sorted? If so, the whole process can be done in a much more efficient manner. Even if they're not sorted, it could well be faster to sort them and then compute what you need.
Yes, they're sorted and indexed. HTSeq uses the index when I do
bam[window]
to retrieve reads in that window. However I also thought I should take more advantage of the sorted/indexed bam to count the distances. How would I go about doing that?I wouldn't bother with htseq at all. Just use pysam to iterate over each alignment. You just want to know (A) the bounds of each alignment and (B) the bounds of the alignments before/after it. From that, you can determine the distance to the nearest alignment and just increment a counter vector appropriately. This should require only a single pass through the file and only a few bytes of memory.
The slowness in your current method is due to storing each window and then seeing for each alignment if its window is already in there. That's a huge number of comparisons that simply don't need to be made.
Counting the distance to the nearest read (for all reads) is obviously the simplest way of doing it. Yet, You'd lose information regarding other reads which would be near within a window with the distance of interest (~200bp) and I'm not sure that was the way the figure above was made.
I think I'll do it in this simpler way and see how it compares. Thanks for the input.
Ah, yeah, if you want a count of all alignments within a window around all other alignments then your implementation isn't that bad. You could make it more efficient in a similar manner to that I previously suggested, but honestly the time it'd take to write and debug that is probably longer than the time required to run what you already have. So unless you have to do this a lot or want to publish it as part of something else then you probably already have a good enough solution.