Question

pythonic equivalent to reduce() in R GRanges - how to collapse ranged data?

2

Entering edit mode

6.8 years ago

christopher medway ▴ 460

I posted this question on stackoverflow, but it did not get a response. It is a bioinformatics query, so perhaps this is a better forum: Is there are pythonic equivalent to the R ranges operation below?

In R (albeit longwinded):

Here is a test data.frame

df <- data.frame(
  "CHR" = c(1,1,1,2,2),
  "START" = c(100, 200, 300, 100, 400),
  "STOP" = c(150,350,400,500,450)
  )

First I make GRanges object:

gr <- GenomicRanges::GRanges(
  seqnames = df$CHR,
  ranges = IRanges(start = df$START, end = df$STOP)
  )

Then I reduce the intervals to collapse into new granges object:

reduced <- reduce(gr)

Now append a new column to original dataframe which confirms which rows belong to the same contiguous 'chunk'.

subjectHits(findOverlaps(gr, reduced))

Output:

> df
  CHR START STOP locus
1   1   100  150     1
2   1   200  350     2
3   1   300  400     2
4   2   100  500     3
5   2   400  450     3

How do I do this in Python?

R python • 2.4k views

ADD COMMENT • link updated 5.7 years ago by endrebak ▴ 980 • written 6.8 years ago by christopher medway ▴ 460

1

Entering edit mode

in what structure is your data stored in python?

ADD REPLY • link 6.8 years ago by Martombo ★ 3.1k

0

Entering edit mode

The data is stored as a CSV to disk. As a python newbie, I guess I would load as a pandas table, but I am open to suggestions.

ADD REPLY • link 6.8 years ago by christopher medway ▴ 460

score 4 · Accepted Answer · 2019-04-14

https://github.com/biocore-ntnu/pyranges

import pyranges as pr
chromosomes = [1] * 3 + [2] * 2
starts = [100, 200, 300, 100, 400]
ends = [150, 350, 400, 500, 450]
gr = pr.PyRanges(chromosomes=chromosomes, starts=starts, ends=ends)
print(gr.cluster())
# +--------------+-----------+-----------+-----------+
# |   Chromosome |     Start |       End |   Cluster |
# |       (int8) |   (int32) |   (int32) |   (int64) |
# |--------------+-----------+-----------+-----------|
# |            1 |       100 |       150 |         1 |
# |            1 |       200 |       350 |         2 |
# |            1 |       300 |       400 |         2 |
# |            2 |       100 |       500 |         3 |
# |            2 |       400 |       450 |         3 |
# +--------------+-----------+-----------+-----------+

It will be out in 0.0.21. Thanks for the idea!