Michael Dondrup's comment is probably the best solution if you want to do anything complex with ranges, but for the problem stated (intergenic distance calculation), you really only need a few lines of python code:
# I don't know what your data looks like, so I'll start with a list of
# name,chromosome,start_pos,end_pos tuples
gene_positions = [('geneA', 'chr1', 100,200), ('geneB', 'chr1', 300,400),
('geneC', 'chr1', 401, 450), ('geneD', 'chr2', 100,200)]
# sorting genes by the chromosome and start_position fields so they're in order
gene_positions.sort(key = lambda x: (x[1],x[2]))
# start with intergenic distances as an empty list
intergenic_distances = []
# using zip to get a list of pairs of adjacent genes
for gene1,gene2 in zip(gene_positions, gene_positions[1:]):
(gene1_name,gene1_chr,gene1_start,gene1_end) = gene1
(gene2_name,gene2_chr,gene2_start,gene2_end) = gene2
# only take gene pairs on the same chromosome
if gene1_chr == gene2_chr:
# add the distance and both gene names to intergenic_distance
# (may need to adjust the distance math depending
# if the positions are end-inclusive)
intergenic_distances.append((gene2_start - gene1_end,
gene1_name, gene2_name))
I don't know what you want to do with the data afterward, but you get a list of (intergenic_distance, gene1_name, gene2_name) tuples, like this:
for distance,gene1,gene2 in intergenic_distances:
print "Distance between %s and %s is %s"%(gene1,gene2,distance)
# For the example data above, this prints the following:
# Distance between geneA and geneB is 100
# Distance between geneB and geneC is 1
Of course, you probably already knew that.
There are several python packages and other alternatives mentioned here: http://biostar.stackexchange.com/questions/2245/what-is-the-quickest-algorithm-for-range-overlap
I remember that package was called 'pygr', but I haven't used it myself.
I think it would be better if you paste a little portion of your input file on your question.
Hi Abhi. Can you post your input file?