Handling Ranged Data In Python
2
3
Entering edit mode
12.8 years ago
Abhi ★ 1.6k

Hi Guys

I am wondering if there is something that I could use to handle ranged data in python similar to GenomicRanges/IRanges in R ?

A specific example:

My input data is gene coordinates and chromosomes. For each gene I would like to know the end coordinate of the gene just before it to calculate the intergenic distance.

Thanks! -Abhi

python biopython r • 5.0k views
ADD COMMENT
4
Entering edit mode

There are several python packages and other alternatives mentioned here: http://biostar.stackexchange.com/questions/2245/what-is-the-quickest-algorithm-for-range-overlap

ADD REPLY
0
Entering edit mode

I remember that package was called 'pygr', but I haven't used it myself.

ADD REPLY
0
Entering edit mode

I think it would be better if you paste a little portion of your input file on your question.

ADD REPLY
0
Entering edit mode

Hi Abhi. Can you post your input file?

ADD REPLY
4
Entering edit mode
12.7 years ago
Weronika ▴ 300

Michael Dondrup's comment is probably the best solution if you want to do anything complex with ranges, but for the problem stated (intergenic distance calculation), you really only need a few lines of python code:

# I don't know what your data looks like, so I'll start with a list of
#  name,chromosome,start_pos,end_pos tuples
gene_positions = [('geneA', 'chr1', 100,200), ('geneB', 'chr1', 300,400), 
                  ('geneC', 'chr1', 401, 450), ('geneD', 'chr2', 100,200)]
# sorting genes by the chromosome and start_position fields so they're in order
gene_positions.sort(key = lambda x: (x[1],x[2]))

# start with intergenic distances as an empty list
intergenic_distances = []

# using zip to get a list of pairs of adjacent genes
for gene1,gene2 in zip(gene_positions, gene_positions[1:]):
    (gene1_name,gene1_chr,gene1_start,gene1_end) = gene1
    (gene2_name,gene2_chr,gene2_start,gene2_end) = gene2
    # only take gene pairs on the same chromosome
    if gene1_chr == gene2_chr:
        # add the distance and both gene names to intergenic_distance 
        #  (may need to adjust the distance math depending 
        #   if the positions are end-inclusive)
        intergenic_distances.append((gene2_start - gene1_end, 
                                     gene1_name, gene2_name))

I don't know what you want to do with the data afterward, but you get a list of (intergenic_distance, gene1_name, gene2_name) tuples, like this:

for distance,gene1,gene2 in intergenic_distances:
  print "Distance between %s and %s is %s"%(gene1,gene2,distance)
# For the example data above, this prints the following:
#  Distance between geneA and geneB is 100
#  Distance between geneB and geneC is 1

Of course, you probably already knew that.

ADD COMMENT
1
Entering edit mode
6.6 years ago
endrebak ▴ 980

You now can: https://github.com/endrebak/pyranges

It is still in pre-alpha, but will be updated by me as I use it all the time.

Edit: in beta now. To my knowledge it is faster and more widely tested than any other alternative.

ADD COMMENT

Login before adding your answer.

Traffic: 1871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6