This problem becomes a lot easier if the start stop are in the form of start<stop, (you'll then need a strand column of course) then in addition you sort the file by the start. That's what I would do first rather than come up with a more complex solution.
@eric upvoted, i like this solution. just to note that try/except/continue for all errors is going to make it hard to debug in the case of weird input.
@brentp Yes, I considered making 2 or 3 try/except cases, each with an possible 'error message' and that would have made it easier to debug. I just wanted a very concise and comprehensible code that did the trick, given the input data of this problem. Cheers
Here's a quick solution, and should run pretty fast. you can easily make it work per reference as well, though here i just assume you have only a single reference as you suggest.
import numpy as np
import sys
# since we dont know the max len beforehand,
# just use a big number and then chop it later.
MAX_LEN = 3e8
coverage = np.zeros((MAX_LEN,), dtype=np.uint8)
for i, line in enumerate(open(sys.argv[1])):
# logic for 1st line here.
if i == 0: continue
line = line.rstrip().split()
ref = line[0]
start, end = sorted(map(int, line[1:]))
coverage[start - 1:end] |= 1
print len(coverage[coverage == 1])
and run from the commandline with your alignment file as the 1st argument.
btw, the above uses memory proportional to the size of the reference sequence, independent of the number and content of reads. i think that's important for NGS.
The program only consumes about 250MB memory yet it creates a 300 million long array - that surprised me; then I noticed the one byte long datatype, neat!
If you fancy perl, and have faith in the bazaar or are generally lazy, then install Set::IntSpan (a module I find generally useful for genomic interval hacking in Perl)
allowing you to write this one-liner:
perl -MSet::IntSpan -an -e 'BEGIN{$c=Set::IntSpan->new} next if $. == 1; $c += [[sort @F[1,2]]]; END{print $c->size}'
This works fine. Set::IntSpan allows us to avoid reinventing the wheel. The code is both clear and brief. It allows us to think in terms of the domain, rather than the implementation. Thank you!
My C++ solution (does NOT need to store all the data in memory. You just need to keep the contigs.
Compiling/Testing
all:
g++ coverage.cpp
echo -e "query\tstart\tstop\nquery1\t100\t120\nquery2\t50\t99\nquery3\t130\t110" | ./a.out
time mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18 \
-e 'select name as "query",if(strand="-",txEnd+1,txStart) as "start",if(strand="-",txStart,txEnd+1) as "stop" from knownGene where chrom="chr1"' | ./a.out
Output:
g++ coverage.cpp
echo -e "query\tstart\tstop\nquery1\t100\t120\nquery2\t50\t99\nquery3\t130\t110" | ./a.out
coverage=81 pb
time mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18 \
-e 'select name as "query",if(strand="-",txEnd+1,txStart) as "start",if(strand="-",txStart,txEnd+1) as "stop" from knownGene where chrom="chr1"' | ./a.out
coverage=114657440 pb
real 0m3.403s
user 0m0.076s
sys 0m0.008s
@pierre, by "it does not need to store the results in memory", that's only for overlaps, right? if there are no overlaps, it will store everything, or am i misreading your code?
advice: swap is a standard function of C++, you could overload the '<' operator rather than declaring ComQuery, you could also avoid to put everything in memory
This problem becomes a lot easier if the start stop are in the form of start<stop, (you'll then need a strand column of course) then in addition you sort the file by the start. That's what I would do first rather than come up with a more complex solution.
Is it sorted or not?
We need more of these challenges :)
that would be considered part of the solution
that could be considered part of the solution
not sorted -i'll make that clearer
Correction: "In this case it is 21+50+21=91" since |130-110|+1=21, not 10.
Correction: "In this case it is 21+50+21=92" since |130-110|+1=21, not 10.