Quick Programming Challenge: How Do I Calculate Reference Coverage From A Table Of Alignment Starts And Ends?
8
7
Entering edit mode
14.6 years ago

Imagine I have a reference sequence and some query sequences align to it

length=|end-start|+1

when end < start the query sequence has aligned to the negative strand

query   start stop
query1   100   120
query2    50    99
query3   130   110

I want to know the total coverage of the reference sequence, ignoring overlaps, so the number of reference bases covered by one or more alignments.

In this case it is 21+50+10=81

Any language is fine. Assume the input is tab delimited.

Note:This is not an invitation to recommend your favorite third-party alignment tool - you must deal with the data as I have presented it.

code alignment programming • 7.8k views
ADD COMMENT
1
Entering edit mode

This problem becomes a lot easier if the start stop are in the form of start<stop, (you'll then need a strand column of course) then in addition you sort the file by the start. That's what I would do first rather than come up with a more complex solution.

ADD REPLY
1
Entering edit mode

Is it sorted or not?

ADD REPLY
1
Entering edit mode

We need more of these challenges :)

ADD REPLY
0
Entering edit mode

that would be considered part of the solution

ADD REPLY
0
Entering edit mode

that could be considered part of the solution

ADD REPLY
0
Entering edit mode

not sorted -i'll make that clearer

ADD REPLY
0
Entering edit mode

Correction: "In this case it is 21+50+21=91" since |130-110|+1=21, not 10.

ADD REPLY
0
Entering edit mode

Correction: "In this case it is 21+50+21=92" since |130-110|+1=21, not 10.

ADD REPLY
6
Entering edit mode
14.6 years ago

Here is another approach in Python making few assumptions and using very little memory:

in_file = "queries.txt"
coverage = set()

with open(in_file) as f:
    for line in f:
        try:
            query, v1, v2 = line.strip().split("\t")
            v_min, v_max = sorted([int(v1), int(v2)])
        except:
            continue
        coverage.update(range(v_min, v_max + 1)) # as suggested by brentp

print "Reference coverage:", len(coverage)

Here is what I get when I run it:

time python golf-course.py 
Reference coverage: 81

real    0m0.026s
user    0m0.016s
sys     0m0.008ss

It should also be pretty 'Pythonic' :)

Cheers

ADD COMMENT
0
Entering edit mode

that is probably the most readable yet. great job

ADD REPLY
0
Entering edit mode

pythonic would use coverage.update(xrange(v_min, vmax + 1)) instead of using a loop.

ADD REPLY
0
Entering edit mode

That's true, I made the change! Thanks @brentp

ADD REPLY
0
Entering edit mode

@eric upvoted, i like this solution. just to note that try/except/continue for all errors is going to make it hard to debug in the case of weird input.

ADD REPLY
0
Entering edit mode

@brentp Yes, I considered making 2 or 3 try/except cases, each with an possible 'error message' and that would have made it easier to debug. I just wanted a very concise and comprehensible code that did the trick, given the input data of this problem. Cheers

ADD REPLY
0
Entering edit mode

I don't use python but it's a very elegant code... +1

ADD REPLY
5
Entering edit mode
14.6 years ago
brentp 24k

Here's a quick solution, and should run pretty fast. you can easily make it work per reference as well, though here i just assume you have only a single reference as you suggest.

import numpy as np
import sys
# since we dont know the max len beforehand,
# just use a big number and then chop it later.
MAX_LEN = 3e8

coverage = np.zeros((MAX_LEN,), dtype=np.uint8)

for i, line in enumerate(open(sys.argv[1])):
    # logic for 1st line here.
    if i == 0: continue
    line = line.rstrip().split()
    ref = line[0]
    start, end = sorted(map(int, line[1:]))

    coverage[start - 1:end] |= 1

print len(coverage[coverage == 1])

and run from the commandline with your alignment file as the 1st argument.

ADD COMMENT
1
Entering edit mode

@nuin: well, numpy is pretty usual for a bioinformatics program

ADD REPLY
0
Entering edit mode

btw, the above uses memory proportional to the size of the reference sequence, independent of the number and content of reads. i think that's important for NGS.

ADD REPLY
0
Entering edit mode

that's quite elegant

ADD REPLY
0
Entering edit mode

+1 How can one not love Python when it makes things so easy, beautiful, AND understandable!

ADD REPLY
0
Entering edit mode

+1 for elegance

ADD REPLY
0
Entering edit mode

... until you find out that numpy is not installed in your system ...

ADD REPLY
0
Entering edit mode

The program only consumes about 250MB memory yet it creates a 300 million long array - that surprised me; then I noticed the one byte long datatype, neat!

ADD REPLY
0
Entering edit mode

Ideally, imports should be limited to modules distributed with Python.

ADD REPLY
0
Entering edit mode

it is on a Mac ;-)

ADD REPLY
5
Entering edit mode
14.6 years ago
Malcolm.Cook ★ 1.5k

If you fancy perl, and have faith in the bazaar or are generally lazy, then install Set::IntSpan (a module I find generally useful for genomic interval hacking in Perl)

allowing you to write this one-liner:

perl -MSet::IntSpan -an -e 'BEGIN{$c=Set::IntSpan->new} next if $. == 1; $c += [[sort @F[1,2]]]; END{print $c->size}'
ADD COMMENT
0
Entering edit mode

cool. Set::IntSpan looks like a nice abstraction. might look nicer on multiple lines. :-)

ADD REPLY
0
Entering edit mode

This works fine. Set::IntSpan allows us to avoid reinventing the wheel. The code is both clear and brief. It allows us to think in terms of the domain, rather than the implementation. Thank you!

ADD REPLY
0
Entering edit mode

Fine! Then you might also like its extension, Set::IntSpan::Island, especially if you're hacking over spliced transcripts / gapped alignments in perl.

ADD REPLY
0
Entering edit mode

Fine! Then you might also like its extension, Set::IntSpan::Island, especially if you're hacking over spliced transcripts / gapped alignments in perl.

ADD REPLY
0
Entering edit mode

Fine! Then especially if you're hacking over spliced transcripts / gapped alignments in perl you might also like its extension Set::IntSpan::Island

ADD REPLY
4
Entering edit mode
14.6 years ago

My C++ solution (does NOT need to store all the data in memory. You just need to keep the contigs.

Compiling/Testing

all:
    g++ coverage.cpp
    echo -e "query\tstart\tstop\nquery1\t100\t120\nquery2\t50\t99\nquery3\t130\t110" | ./a.out
    time mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18  \
      -e 'select name as "query",if(strand="-",txEnd+1,txStart) as "start",if(strand="-",txStart,txEnd+1) as "stop" from knownGene where chrom="chr1"' | ./a.out

Output:

g++ coverage.cpp
echo -e "query\tstart\tstop\nquery1\t100\t120\nquery2\t50\t99\nquery3\t130\t110" | ./a.out
coverage=81 pb
time mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18  \
          -e 'select name as "query",if(strand="-",txEnd+1,txStart) as "start",if(strand="-",txStart,txEnd+1) as "stop" from knownGene where chrom="chr1"' | ./a.out
coverage=114657440 pb

real    0m3.403s
user    0m0.076s
sys     0m0.008s
ADD COMMENT
0
Entering edit mode

@pierre, by "it does not need to store the results in memory", that's only for overlaps, right? if there are no overlaps, it will store everything, or am i misreading your code?

ADD REPLY
0
Entering edit mode

you're right. It just stores the contigs 'on the fly'. No need to (1) first load all the data (2) remove the overlaps

ADD REPLY
0
Entering edit mode

@brentp I fixed the sentence.

ADD REPLY
2
Entering edit mode
14.6 years ago
Bilouweb ★ 1.1k

I made it in C++ because it is my "everyday life" language ;)

The idea is :

  1. read queries and swap if start > stop
  2. sort queries based on "start"
  3. merge queries if they overlap
  4. sum the coverage of each query

#include <vector>
#include <iostream>
#include <algorithm>

using namespace std;

// Class to store start and stop positions
class Query
{
  public :
    int start;
    int stop; 
    Query(int i, int j):start(i),stop(j){
      if (start > stop)
        swap();  
    }

    void swap(){
      int i(start);
      start = stop;
      stop = i;
    }
};

// function to compare start positions of two queries
bool CompQuery (Query i,Query j){return (i.start < j.start);}

int main(int argc, char * argv)
{
  // Create query tab with start and stop positions
  vector<Query> tab;
  vector<Query>::iterator it;
  tab.push_back(Query(100,120));
  tab.push_back(Query(130,110));
  tab.push_back(Query(50,99));
  tab.push_back(Query(165,140));

  // sort query tab by increasing start positions
  sort(tab.begin(), tab.end(), CompQuery);

  // Merge Queries if they cover adjacent segments
  // example : 50->99 and 100->120 are merged in 50->120
  it = tab.begin() + 1;
  int cover = 0;
  while(it != tab.end()){
    if ((*it).start <= (*(it-1)).stop + 1){
      if ((*it).stop > (*(it-1)).stop){
        (*(it-1)).stop = (*it).stop;
      }
      it = tab.erase(it);
    } else {
      cout << (*it).start<<","<<(*it).stop<< "\n";
      cover += (*(it-1)).stop - (*(it-1)).start + 1;
      it++;
    }
  }
  cover += (*(it-1)).stop -(*(it-1)).start + 1;

  // Print Coverage
  cout << "\nCoverage = "<<cover<<"\n";
  return 0;
}
ADD COMMENT
0
Entering edit mode

The last two steps can be made in one loop

ADD REPLY
0
Entering edit mode

advice: swap is a standard function of C++, you could overload the '<' operator rather than declaring ComQuery, you could also avoid to put everything in memory

ADD REPLY
0
Entering edit mode

Yes, I saw this in your code. Good job !

ADD REPLY
0
Entering edit mode

Rennes ? Nantes :-)

ADD REPLY
0
Entering edit mode

The answer lies in Brittany ;-)

ADD REPLY
2
Entering edit mode
14.6 years ago

Here is a version I wrote using R and Bioconductor's IRanges

#!/usr/bin/Rscript
library("IRanges",warn.conflicts=FALSE)
args<-commandArgs(TRUE)
mytab<-read.table(args[1],header=TRUE)
mytab$starts<-apply(mytab[,c(2,3)],1,min)
mytab$stops<-apply(mytab[,c(2,3)],1,max)
myrange<-IRanges(start=mytab$starts,end=mytab$stops)
sum(width(reduce(myrange)))
ADD COMMENT
1
Entering edit mode
14.6 years ago

I swear I've written this code before somewhere, I'll see if I can dig it up later.

In the meantime, try this:

  • In one quick pass, flip any start/stop values and sort the file:

[?]

  • Now, you step through the outfile, merging segments.

Rough algorithm:

  • coverage = 0
  • loop through the reads
  • store the first segment in a buffer
  • check to see if the current segment's start is <= old segment stop
    • if so, update the oldSeg's stop position
    • if not, add the length of the old segment to coverage, then set the oldSeg = currSeg
  • at the end of the loop, clear the last segment out of the buffer and add it's length to coverage.
  • coverage/referenceSize = coverage percentage
ADD COMMENT
0
Entering edit mode

To be clear, other algs may be faster, but this requires almost no memory and scales to as many reads as you've got.

ADD REPLY
0
Entering edit mode

i think it would be great to see a ruby version

ADD REPLY
1
Entering edit mode
14.6 years ago
Yuri ★ 1.7k

Just for the collection, the code in MATLAB (without using any library/toolbox):

q = dlmread('queries.txt','\t',1,1); %# read the file
q = sort(q,2); %# start < end
qidx = ones(size(q)); %# start = 1
qidx(:,2) = -1; %# end = -1
[qsorted, si] = sort(q(:));
sidx = qidx(si);
gap = find(cumsum(sidx)==0); %# find start indices of gaps
boundaryidx = reshape(unique([1; gap; gap(1:end-1)+1]),2,[]);
coverage = sum(diff(qsorted(boundaryidx))+1);
disp(coverage)
ADD COMMENT

Login before adding your answer.

Traffic: 2344 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6