Quick Programming Challenge: How Do I Calculate Reference Coverage From A Table Of Alignment Starts And Ends?

Entering edit mode

15.0 years ago

Jeremy Leipzig 23k

Imagine I have a reference sequence and some query sequences align to it

length=|end-start|+1

when end < start the query sequence has aligned to the negative strand

query   start stop
query1   100   120
query2    50    99
query3   130   110

I want to know the total coverage of the reference sequence, ignoring overlaps, so the number of reference bases covered by one or more alignments.

In this case it is 21+50+10=81

Any language is fine. Assume the input is tab delimited.

Note:This is not an invitation to recommend your favorite third-party alignment tool - you must deal with the data as I have presented it.

code alignment programming • 8.8k views

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 15.0 years ago by Jeremy Leipzig 23k

Entering edit mode

This problem becomes a lot easier if the start stop are in the form of start<stop, (you'll then need a strand column of course) then in addition you sort the file by the start. That's what I would do first rather than come up with a more complex solution.

ADD REPLY • link 15.0 years ago by Istvan Albert 102k

Entering edit mode

Is it sorted or not?

ADD REPLY • link 15.0 years ago by Paulo Nuin ★ 3.7k

Entering edit mode

We need more of these challenges :)

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

Entering edit mode

that would be considered part of the solution

ADD REPLY • link 15.0 years ago by Jeremy Leipzig 23k

Entering edit mode

that could be considered part of the solution

ADD REPLY • link 15.0 years ago by Jeremy Leipzig 23k

Entering edit mode

not sorted -i'll make that clearer

ADD REPLY • link 15.0 years ago by Jeremy Leipzig 23k

Entering edit mode

Correction: "In this case it is 21+50+21=91" since |130-110|+1=21, not 10.

ADD REPLY • link 13.7 years ago by Nikolay Vyahhi ★ 1.3k

Entering edit mode

Correction: "In this case it is 21+50+21=92" since |130-110|+1=21, not 10.

ADD REPLY • link 13.7 years ago by Nikolay Vyahhi ★ 1.3k

Entering edit mode

15.0 years ago

Eric Normandeau 11k

Here is another approach in Python making few assumptions and using very little memory:

in_file = "queries.txt"
coverage = set()

with open(in_file) as f:
    for line in f:
        try:
            query, v1, v2 = line.strip().split("\t")
            v_min, v_max = sorted([int(v1), int(v2)])
        except:
            continue
        coverage.update(range(v_min, v_max + 1)) # as suggested by brentp

print "Reference coverage:", len(coverage)

Here is what I get when I run it:

time python golf-course.py 
Reference coverage: 81

real    0m0.026s
user    0m0.016s
sys     0m0.008ss

It should also be pretty 'Pythonic' :)

Cheers

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 15.0 years ago by Eric Normandeau 11k

Entering edit mode

that is probably the most readable yet. great job

ADD REPLY • link 15.0 years ago by Jeremy Leipzig 23k

Entering edit mode

pythonic would use coverage.update(xrange(v_min, vmax + 1)) instead of using a loop.

ADD REPLY • link 15.0 years ago by brentp 24k

Entering edit mode

That's true, I made the change! Thanks @brentp

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

Entering edit mode

@eric upvoted, i like this solution. just to note that try/except/continue for all errors is going to make it hard to debug in the case of weird input.

ADD REPLY • link 15.0 years ago by brentp 24k

Entering edit mode

@brentp Yes, I considered making 2 or 3 try/except cases, each with an possible 'error message' and that would have made it easier to debug. I just wanted a very concise and comprehensible code that did the trick, given the input data of this problem. Cheers

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

Entering edit mode

I don't use python but it's a very elegant code... +1

ADD REPLY • link 15.0 years ago by Bilouweb ★ 1.1k

Entering edit mode

15.0 years ago

brentp 24k

Here's a quick solution, and should run pretty fast. you can easily make it work per reference as well, though here i just assume you have only a single reference as you suggest.

import numpy as np
import sys
# since we dont know the max len beforehand,
# just use a big number and then chop it later.
MAX_LEN = 3e8

coverage = np.zeros((MAX_LEN,), dtype=np.uint8)

for i, line in enumerate(open(sys.argv[1])):
    # logic for 1st line here.
    if i == 0: continue
    line = line.rstrip().split()
    ref = line[0]
    start, end = sorted(map(int, line[1:]))

    coverage[start - 1:end] |= 1

print len(coverage[coverage == 1])

and run from the commandline with your alignment file as the 1st argument.

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 15.0 years ago by brentp 24k

Entering edit mode

@nuin: well, numpy is pretty usual for a bioinformatics program

ADD REPLY • link 15.0 years ago by Giovanni M Dall'Olio 28k

Entering edit mode

btw, the above uses memory proportional to the size of the reference sequence, independent of the number and content of reads. i think that's important for NGS.

ADD REPLY • link 15.0 years ago by brentp 24k

Entering edit mode

that's quite elegant

ADD REPLY • link 15.0 years ago by Jeremy Leipzig 23k

Entering edit mode

+1 How can one not love Python when it makes things so easy, beautiful, AND understandable!

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

Entering edit mode

+1 for elegance

ADD REPLY • link 15.0 years ago by Pierre Lindenbaum 166k

Entering edit mode

... until you find out that numpy is not installed in your system ...

ADD REPLY • link 15.0 years ago by Paulo Nuin ★ 3.7k

Entering edit mode

The program only consumes about 250MB memory yet it creates a 300 million long array - that surprised me; then I noticed the one byte long datatype, neat!

ADD REPLY • link 15.0 years ago by Istvan Albert 102k

Entering edit mode

Ideally, imports should be limited to modules distributed with Python.

ADD REPLY • link 15.0 years ago by Paulo Nuin ★ 3.7k

Entering edit mode

it is on a Mac ;-)

ADD REPLY • link 15.0 years ago by Istvan Albert 102k

Entering edit mode

15.0 years ago

Malcolm.Cook ★ 1.5k

If you fancy perl, and have faith in the bazaar or are generally lazy, then install Set::IntSpan (a module I find generally useful for genomic interval hacking in Perl)

allowing you to write this one-liner:

perl -MSet::IntSpan -an -e 'BEGIN{$c=Set::IntSpan->new} next if $. == 1; $c += [[sort @F[1,2]]]; END{print $c->size}'

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 15.0 years ago by Malcolm.Cook ★ 1.5k

Entering edit mode

cool. Set::IntSpan looks like a nice abstraction. might look nicer on multiple lines. :-)

ADD REPLY • link 15.0 years ago by brentp 24k

Entering edit mode

This works fine. Set::IntSpan allows us to avoid reinventing the wheel. The code is both clear and brief. It allows us to think in terms of the domain, rather than the implementation. Thank you!

ADD REPLY • link 14.0 years ago by Timur Shtatland • 0

Entering edit mode

Fine! Then you might also like its extension, Set::IntSpan::Island, especially if you're hacking over spliced transcripts / gapped alignments in perl.

ADD REPLY • link 14.0 years ago by Malcolm.Cook ★ 1.5k

Entering edit mode

Fine! Then you might also like its extension, Set::IntSpan::Island, especially if you're hacking over spliced transcripts / gapped alignments in perl.

ADD REPLY • link 14.0 years ago by Malcolm.Cook ★ 1.5k

Entering edit mode

Fine! Then especially if you're hacking over spliced transcripts / gapped alignments in perl you might also like its extension Set::IntSpan::Island

ADD REPLY • link 14.0 years ago by Malcolm.Cook ★ 1.5k

Entering edit mode

15.0 years ago

Pierre Lindenbaum 166k

My C++ solution (does NOT need to store all the data in memory. You just need to keep the contigs.

	#include <iostream>
	#include <vector>
	#include <cstdlib>
	#include <cerrno>
	#include <algorithm>

	using namespace std;

	class Range
	{
	public:
	int start;
	int end;
	inline bool operator < (const Range &other) const
	{
	return start < other.start;
	}
	inline bool overlap(const Range &other) const
	{
	return !(end<=other.start \|\| other.end<=start);
	}
	void merge(const Range &other)
	{
	start=min(start,other.start);
	end=max(end,other.end);
	}
	};

	int main(int argc,char **argv)
	{
	const char TAB='\t';
	vector<Range> ranges;
	string line;

	if(!getline(cin,line,'\n'))
	{
	cerr << "first line missing" << endl;
	return EXIT_FAILURE;
	}
	if(line.compare("query\tstart\tstop")!=0)
	{
	cerr << "bad header \""<< line << "\"" << endl;
	return EXIT_FAILURE;
	}

	while(getline(cin,line,'\n'))
	{
	int d1=line.find(TAB);
	if(d1==string::npos )
	{
	cerr << "no tab in "<< line << endl;
	return EXIT_FAILURE;
	}
	else if(d1==0)
	{
	cerr << "no name in "<< line << endl;
	return EXIT_FAILURE;
	}
	int d2=line.find(TAB,d1+1);
	if(d2==string::npos)
	{
	cerr << "second tab missing in "<< line << endl;
	return EXIT_FAILURE;
	}
	Range range;
	char p=const_cast<char>(line.c_str());
	char *p2;
	p[d2]=0;
	errno=0;
	range.start=strtod(&p[d1+1],&p2);
	if(range.start<0 \|\| errno!=0 \|\| *p2!=0)
	{
	cerr << "bad start in "<< line << endl;
	return EXIT_FAILURE;
	}
	errno=0;
	range.end=strtod(&p[d2+1],&p2);
	if(range.end<0 \|\| errno!=0 \|\| *p2!=0)
	{
	cerr << "bad end in "<< line << endl;
	return EXIT_FAILURE;
	}
	if(range.start>range.end)
	{
	swap(range.start,range.end);
	}

	if(ranges.empty())
	{
	ranges.push_back(range);
	}
	else
	{
	vector<Range>::iterator iter= lower_bound(
	ranges.begin(),
	ranges.end(),
	range
	);

	int index= distance(ranges.begin(),iter);
	ranges.insert( iter, range );

	bool done=false;
	do
	{
	done=true;

	//remove right side
	while(index+1 < ranges.size() &&
	ranges[index].overlap(ranges[index+1]))
	{

	ranges[index].merge(ranges[index+1]);
	ranges.erase(ranges.begin()+(index+1));
	done=false;
	}
	//remove left side
	while(index>0 &&
	ranges[index-1].overlap(ranges[index]))
	{

	ranges[index-1].merge(ranges[index]);
	ranges.erase(ranges.begin()+index);
	done=false;
	--index;
	}

	} while(!done);
	}

	}
	int sum=0;
	for(vector<Range>::const_iterator i=ranges.begin();
	i!=ranges.end();
	++i)
	{
	sum+=((i).end - (i).start)+1;
	}
	cout << "coverage="<< sum << " pb" << endl;
	return 0;
	}

view raw biostars-835.cpp hosted with ❤ by GitHub

Compiling/Testing

all:
    g++ coverage.cpp
    echo -e "query\tstart\tstop\nquery1\t100\t120\nquery2\t50\t99\nquery3\t130\t110" | ./a.out
    time mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18  \
      -e 'select name as "query",if(strand="-",txEnd+1,txStart) as "start",if(strand="-",txStart,txEnd+1) as "stop" from knownGene where chrom="chr1"' | ./a.out

Output:

g++ coverage.cpp
echo -e "query\tstart\tstop\nquery1\t100\t120\nquery2\t50\t99\nquery3\t130\t110" | ./a.out
coverage=81 pb
time mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18  \
          -e 'select name as "query",if(strand="-",txEnd+1,txStart) as "start",if(strand="-",txStart,txEnd+1) as "stop" from knownGene where chrom="chr1"' | ./a.out
coverage=114657440 pb

real    0m3.403s
user    0m0.076s
sys     0m0.008s

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 15.0 years ago by Pierre Lindenbaum 166k

Entering edit mode

@pierre, by "it does not need to store the results in memory", that's only for overlaps, right? if there are no overlaps, it will store everything, or am i misreading your code?

ADD REPLY • link 15.0 years ago by brentp 24k

Entering edit mode

you're right. It just stores the contigs 'on the fly'. No need to (1) first load all the data (2) remove the overlaps

ADD REPLY • link 15.0 years ago by Pierre Lindenbaum 166k

Entering edit mode

@brentp I fixed the sentence.

ADD REPLY • link 15.0 years ago by Pierre Lindenbaum 166k

Entering edit mode

15.0 years ago

Bilouweb ★ 1.1k

I made it in C++ because it is my "everyday life" language ;)

The idea is :

read queries and swap if start > stop
sort queries based on "start"
merge queries if they overlap
sum the coverage of each query

#include <vector>
#include <iostream>
#include <algorithm>

using namespace std;

// Class to store start and stop positions
class Query
{
  public :
    int start;
    int stop; 
    Query(int i, int j):start(i),stop(j){
      if (start > stop)
        swap();  
    }

    void swap(){
      int i(start);
      start = stop;
      stop = i;
    }
};

// function to compare start positions of two queries
bool CompQuery (Query i,Query j){return (i.start < j.start);}

int main(int argc, char * argv)
{
  // Create query tab with start and stop positions
  vector<Query> tab;
  vector<Query>::iterator it;
  tab.push_back(Query(100,120));
  tab.push_back(Query(130,110));
  tab.push_back(Query(50,99));
  tab.push_back(Query(165,140));

  // sort query tab by increasing start positions
  sort(tab.begin(), tab.end(), CompQuery);

  // Merge Queries if they cover adjacent segments
  // example : 50->99 and 100->120 are merged in 50->120
  it = tab.begin() + 1;
  int cover = 0;
  while(it != tab.end()){
    if ((*it).start <= (*(it-1)).stop + 1){
      if ((*it).stop > (*(it-1)).stop){
        (*(it-1)).stop = (*it).stop;
      }
      it = tab.erase(it);
    } else {
      cout << (*it).start<<","<<(*it).stop<< "\n";
      cover += (*(it-1)).stop - (*(it-1)).start + 1;
      it++;
    }
  }
  cover += (*(it-1)).stop -(*(it-1)).start + 1;

  // Print Coverage
  cout << "\nCoverage = "<<cover<<"\n";
  return 0;
}

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 15.0 years ago by Bilouweb ★ 1.1k

Entering edit mode

The last two steps can be made in one loop

ADD REPLY • link 15.0 years ago by Bilouweb ★ 1.1k

Entering edit mode

advice: swap is a standard function of C++, you could overload the '<' operator rather than declaring ComQuery, you could also avoid to put everything in memory

ADD REPLY • link 15.0 years ago by Pierre Lindenbaum 166k

Entering edit mode

Yes, I saw this in your code. Good job !

ADD REPLY • link 15.0 years ago by Bilouweb ★ 1.1k

Entering edit mode

Rennes ? Nantes :-)

ADD REPLY • link 15.0 years ago by Pierre Lindenbaum 166k

Entering edit mode

The answer lies in Brittany ;-)

ADD REPLY • link 15.0 years ago by Bilouweb ★ 1.1k

Entering edit mode

15.0 years ago

Jeremy Leipzig 23k

Here is a version I wrote using R and Bioconductor's IRanges

#!/usr/bin/Rscript
library("IRanges",warn.conflicts=FALSE)
args<-commandArgs(TRUE)
mytab<-read.table(args[1],header=TRUE)
mytab$starts<-apply(mytab[,c(2,3)],1,min)
mytab$stops<-apply(mytab[,c(2,3)],1,max)
myrange<-IRanges(start=mytab$starts,end=mytab$stops)
sum(width(reduce(myrange)))

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 15.0 years ago by Jeremy Leipzig 23k

Entering edit mode

15.0 years ago

Chris Miller 22k

I swear I've written this code before somewhere, I'll see if I can dig it up later.

In the meantime, try this:

In one quick pass, flip any start/stop values and sort the file:

[?]

Now, you step through the outfile, merging segments.

Rough algorithm:

coverage = 0
loop through the reads
store the first segment in a buffer
check to see if the current segment's start is <= old segment stop
- if so, update the oldSeg's stop position
- if not, add the length of the old segment to coverage, then set the oldSeg = currSeg
at the end of the loop, clear the last segment out of the buffer and add it's length to coverage.
coverage/referenceSize = coverage percentage

ADD COMMENT • link 15.0 years ago by Chris Miller 22k

Entering edit mode

To be clear, other algs may be faster, but this requires almost no memory and scales to as many reads as you've got.

ADD REPLY • link 15.0 years ago by Chris Miller 22k

Entering edit mode

i think it would be great to see a ruby version

ADD REPLY • link 15.0 years ago by Jeremy Leipzig 23k

Entering edit mode

15.0 years ago

Yuri ★ 1.7k

Just for the collection, the code in MATLAB (without using any library/toolbox):

q = dlmread('queries.txt','\t',1,1); %# read the file
q = sort(q,2); %# start < end
qidx = ones(size(q)); %# start = 1
qidx(:,2) = -1; %# end = -1
[qsorted, si] = sort(q(:));
sidx = qidx(si);
gap = find(cumsum(sidx)==0); %# find start indices of gaps
boundaryidx = reshape(unique([1; gap; gap(1:end-1)+1]),2,[]);
coverage = sum(diff(qsorted(boundaryidx))+1);
disp(coverage)

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 15.0 years ago by Yuri ★ 1.7k