Question

Handling Ranged Data In Python

3

Entering edit mode

12.7 years ago

Abhi ★ 1.6k

Hi Guys

I am wondering if there is something that I could use to handle ranged data in python similar to GenomicRanges/IRanges in R ?

A specific example:

My input data is gene coordinates and chromosomes. For each gene I would like to know the end coordinate of the gene just before it to calculate the intergenic distance.

Thanks! -Abhi

python biopython r • 4.9k views

ADD COMMENT • link updated 6.5 years ago by endrebak ▴ 970 • written 12.7 years ago by Abhi ★ 1.6k

4

Entering edit mode

There are several python packages and other alternatives mentioned here: http://biostar.stackexchange.com/questions/2245/what-is-the-quickest-algorithm-for-range-overlap

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 12.7 years ago by Michael 55k

0

Entering edit mode

I remember that package was called 'pygr', but I haven't used it myself.

ADD REPLY • link 12.7 years ago by Michael 55k

0

Entering edit mode

I think it would be better if you paste a little portion of your input file on your question.

ADD REPLY • link 12.7 years ago by Geparada ★ 1.5k

0

Entering edit mode

Hi Abhi. Can you post your input file?

ADD REPLY • link 12.7 years ago by Tommy Carstensen ▴ 40

score 4 · Answer 1 · 2012-04-01

Michael Dondrup's comment is probably the best solution if you want to do anything complex with ranges, but for the problem stated (intergenic distance calculation), you really only need a few lines of python code:

# I don't know what your data looks like, so I'll start with a list of
#  name,chromosome,start_pos,end_pos tuples
gene_positions = [('geneA', 'chr1', 100,200), ('geneB', 'chr1', 300,400), 
                  ('geneC', 'chr1', 401, 450), ('geneD', 'chr2', 100,200)]
# sorting genes by the chromosome and start_position fields so they're in order
gene_positions.sort(key = lambda x: (x[1],x[2]))

# start with intergenic distances as an empty list
intergenic_distances = []

# using zip to get a list of pairs of adjacent genes
for gene1,gene2 in zip(gene_positions, gene_positions[1:]):
    (gene1_name,gene1_chr,gene1_start,gene1_end) = gene1
    (gene2_name,gene2_chr,gene2_start,gene2_end) = gene2
    # only take gene pairs on the same chromosome
    if gene1_chr == gene2_chr:
        # add the distance and both gene names to intergenic_distance 
        #  (may need to adjust the distance math depending 
        #   if the positions are end-inclusive)
        intergenic_distances.append((gene2_start - gene1_end, 
                                     gene1_name, gene2_name))

I don't know what you want to do with the data afterward, but you get a list of (intergenic_distance, gene1_name, gene2_name) tuples, like this:

for distance,gene1,gene2 in intergenic_distances:
  print "Distance between %s and %s is %s"%(gene1,gene2,distance)
# For the example data above, this prints the following:
#  Distance between geneA and geneB is 100
#  Distance between geneB and geneC is 1

Of course, you probably already knew that.

score 1 · Answer 2 · 2018-05-24

1

Entering edit mode

6.5 years ago

endrebak ▴ 970

You now can: https://github.com/endrebak/pyranges

It is still in pre-alpha, but will be updated by me as I use it all the time.

Edit: in beta now. To my knowledge it is faster and more widely tested than any other alternative.

ADD COMMENT • link 5.6 years ago by endrebak ▴ 970