I am trying to do some interval range operations on two files with folowing conditions check if chrom is equal, then check if my start and end of of co0rdinatefile is within or equal to start and end of gene_annotation file (condition if the strand is "+" the start and end will be for eg 10-20, if its "-" it will be 20-10) , if match print start end strand from coordinate and gene_id, gene_name from geneannotation file. (for representation purpose I have head annoataion file )
number of rows in annotation file ~50000 number of rows in coordinated file ~200,000
gene_annotationfile
import csv
with open('~/coordintes.txt', 'r') as source:
coordinates = list(csv.reader(source, delimiter="\t"))
# print coordinates
with open('~/gene_final_anno.csv', 'r') as source:
annotations = list(csv.reader(source, delimiter=","))
# print annotations
for index,line in enumerate(coordinates):
for index2,line2 in enumerate(annotations):
# print line2[1]
if line[0] == line2[0] and line[3] == "-":
line2[1] >= line[1] and line2[2] <= line[2]
else:
line[1] <= line2[1] and line2[2] >= line[2]
print "%s\t%s\t%s\t%s\t%s\t%s" % (line[0],line[1],line[2],line[5],line2[4],line2[5])
break
this is the output that I get
chrom start end strand gene_id gene_name
4 3356893 3359510 + gene_id gene_name
4 3363294 3424125 - gene_id gene_name
4 3413587 3424125 - gene_id gene_name
4 3424201 3425964 - gene_id gene_name
4 3424201 3424529 - gene_id gene_name
4 3424197 3425964 - gene_id gene_name
Not sure where am I going wrong any help is appreciated
In your script, you forgot to convert start and end coordinates into integer numbers. If you show us a sample of your query files (i.e. coordinates, gene_final_anno.csv) we would probably be more helpful.
sorry should have done this when I posted the question gene_final_anno.csv (head)
coordinates head
gene_final_anno, is nothing but annotation file from biomart, for mm9.37.67, and coordinates file is from tophat junction bed representing junctions need to annotate each junction with gene id and name.
As written by combio.pl
That's important.
Ok will change that, any more pointers?
Why are you using enumerate? I don't see where you use those
index
andindex2
variables.It's to me not really clear what you aim to achieve, but I have the idea that more appropriate tools exist rather than rolling your own solution.
@WouterDeCoster upstream and downstream of this code is in python hence..
There might be solution using the htseq-count python module, or similar.
htseq count uses sorted bam (by name) and gtf right? how can I use it for such operations?
I don't have code ready for that, but have a look at http://www-huber.embl.de/HTSeq/doc/tour.html#genomic-intervals-and-genomic-arrays