Tool:(pre-alpha) pyranges: performant, pythonic GenomicRanges
0
2
Entering edit mode
6.6 years ago
endrebak ▴ 980

GenomicRanges for Python.

This library tries to be a thin, but extremely useful wrapper around genomic data contained in pandas dataframes. This allows for all the wonderful functionality of bedtools/bedops and/or GenomicRanges, while being able to use the the enormous universe of Python datascience libraries to manipulate and do computations on the data.

PyRanges also contains a run-length encoding library for extremely efficient arithmetic computation of scores associated with genomic intervals.

Repo: https://github.com/endrebak/pyranges

Docs: http://pyranges.readthedocs.io/

pip install pyranges # Try the examples in the docs, whydontcha

Most desired: feedback, bug reports and ideas. I do not need PR's yet as the underlying code might change greatly.

>>> import pyranges as pr

>>> cs = pr.load_dataset("chipseq")

>>> cs

+--------------|-----------|-----------|--------|---------|----------+
| Chromosome   | Start     | End       | Name   | Score   | Strand   |
|--------------|-----------|-----------|--------|---------|----------|
| chr8         | 28510032  | 28510057  | U0     | 0       | -        |
| chr7         | 107153363 | 107153388 | U0     | 0       | -        |
| chr5         | 135821802 | 135821827 | U0     | 0       | -        |
| ...          | ...       | ...       | ...    | ...     | ...      |
| chr6         | 89296757  | 89296782  | U0     | 0       | -        |
| chr1         | 194245558 | 194245583 | U0     | 0       | +        |
| chr8         | 57916061  | 57916086  | U0     | 0       | +        |
+--------------|-----------|-----------|--------|---------|----------+
PyRanges object has 10000 sequences from 24 chromosomes.

>>> bg = pr.load_dataset("chipseq_background")

>>> cs.nearest(bg, suffix="_IP")

+--------------|----------|----------|--------|---------|----------|-----------------|------------|----------|-----------|------------|-------------|------------+
| Chromosome   | Start    | End      | Name   | Score   | Strand   | Chromosome_IP   | Start_IP   | End_IP   | Name_IP   | Score_IP   | Strand_IP   | Distance   |
|--------------|----------|----------|--------|---------|----------|-----------------|------------|----------|-----------|------------|-------------|------------|
| chr1         | 1325303  | 1325328  | U0     | 0       | -        | chr1            | 1041102    | 1041127  | U0        | 0          | +           | 284176     |
| chr1         | 1541598  | 1541623  | U0     | 0       | +        | chr1            | 1770383    | 1770408  | U0        | 0          | -           | 228760     |
| chr1         | 1599121  | 1599146  | U0     | 0       | +        | chr1            | 1770383    | 1770408  | U0        | 0          | -           | 171237     |
| ...          | ...      | ...      | ...    | ...     | ...      | ...             | ...        | ...      | ...       | ...        | ...         | ...        |
| chrY         | 21910706 | 21910731 | U0     | 0       | -        | chrY            | 20557165   | 20557190 | U0        | 0          | +           | 1353516    |
| chrY         | 22054002 | 22054027 | U0     | 0       | -        | chrY            | 20557165   | 20557190 | U0        | 0          | +           | 1496812    |
| chrY         | 22210637 | 22210662 | U0     | 0       | -        | chrY            | 20557165   | 20557190 | U0        | 0          | +           | 1653447    |
+--------------|----------|----------|--------|---------|----------|-----------------|------------|----------|-----------|------------|-------------|------------+
PyRanges object has 10000 sequences from 24 chromosomes.

>>> cs.set_intersection(bg, strandedness="opposite")

+--------------|-----------|-----------|----------+
| Chromosome   |     Start |       End | Strand   |
|--------------|-----------|-----------|----------|
| chr1         | 226987603 | 226987617 | +        |
| chr8         |  38747236 |  38747251 | -        |
+--------------|-----------|-----------|----------+
PyRanges object has 2 sequences from 2 chromosomes.

>>> cv = cs.coverage(stranded=True)
>>> cv

chr1 +
+--------|-----------|------|---------|------|-----------|---------|------|-----------|------|-----------|------+
| Runs   |   1541598 |   25 |   57498 |   25 |   1904886 |  ...    |   25 |   2952580 |   25 |   1156833 |   25 |
|--------|-----------|------|---------|------|-----------|---------|------|-----------|------|-----------|------|
| Values |         0 |    1 |       0 |    1 |         0 | ...     |    1 |         0 |    1 |         0 |    1 |
+--------|-----------|------|---------|------|-----------|---------|------|-----------|------|-----------|------+
Rle of length 247134924 containing 894 elements
...
chrY -
+--------|-----------|------|----------|------|----------|---------|------|----------|------|----------|------+
| Runs   |   7046809 |   25 |   358542 |   25 |   296582 |  ...    |   25 |   143271 |   25 |   156610 |   25 |
|--------|-----------|------|----------|------|----------|---------|------|----------|------|----------|------|
| Values |         0 |    1 |        0 |    1 |        0 | ...     |    1 |        0 |    1 |        0 |    1 |
+--------|-----------|------|----------|------|----------|---------|------|----------|------|----------|------+
Rle of length 22210662 containing 32 elements
PyRles object with 48 chromosomes/strand pairs.

>>> cv + 10.42

chr1 +
+--------|-----------|-------|---------|-------|-----------|---------|-------|-----------|-------|-----------|-------+
| Runs   |   1541598 |    25 |   57498 |    25 |   1904886 |  ...    |    25 |   2952580 |    25 |   1156833 |    25 |
|--------|-----------|-------|---------|-------|-----------|---------|-------|-----------|-------|-----------|-------|
| Values |     10.42 | 11.42 |   10.42 | 11.42 |     10.42 | ...     | 11.42 |     10.42 | 11.42 |     10.42 | 11.42 |
+--------|-----------|-------|---------|-------|-----------|---------|-------|-----------|-------|-----------|-------+
Rle of length 247134924 containing 894 elements
...
chrY -
+--------|-----------|-------|----------|-------|----------|---------|-------|----------|-------|----------|-------+
| Runs   |   7046809 |    25 |   358542 |    25 |   296582 |  ...    |    25 |   143271 |    25 |   156610 |    25 |
|--------|-----------|-------|----------|-------|----------|---------|-------|----------|-------|----------|-------|
| Values |     10.42 | 11.42 |    10.42 | 11.42 |    10.42 | ...     | 11.42 |    10.42 | 11.42 |    10.42 | 11.42 |
+--------|-----------|-------|----------|-------|----------|---------|-------|----------|-------|----------|-------+
Rle of length 22210662 containing 32 elements
PyRles object with 48 chromosomes/strand pairs.

>>> bg_cv = bg.coverage()

>>> cv - bg_cv
chr1
+--------|----------|------|----------|------|---------|---------|------|----------|------|----------|------+
| Runs   |   887771 |   25 |   106864 |   25 |   46417 |  ...    |   25 |   730068 |   25 |   259250 |   25 |
|--------|----------|------|----------|------|---------|---------|------|----------|------|----------|------|
| Values |        0 |   -1 |        0 |   -1 |       0 | ...     |    1 |        0 |   -1 |        0 |    1 |
+--------|----------|------|----------|------|---------|---------|------|----------|------|----------|------+
Rle of length 247134924 containing 3242 elements
...
chrY
+--------|-----------|------|----------|------|----------|---------|------|----------|------|------------|------+
| Runs   |   7046809 |   25 |   147506 |   25 |   211011 |  ...    |   25 |   156610 |   25 |   35191552 |   25 |
|--------|-----------|------|----------|------|----------|---------|------|----------|------|------------|------|
| Values |         0 |    1 |        0 |    1 |        0 | ...     |    1 |        0 |    1 |          0 |   -1 |
+--------|-----------|------|----------|------|----------|---------|------|----------|------|------------|------+
Rle of length 57402239 containing 60 elements
Unstranded PyRles object with 25 chromosomes.

Update: pyranges accepted in bioinformatics. See https://doi.org/10.1093/bioinformatics/btz615

(Sorry for the bump. I wanted to add some examples, plus a better description.)

python genomicranges • 3.2k views
ADD COMMENT
1
Entering edit mode

What are the cliff-notes in terms of how this differs from something like https://github.com/vsbuffalo/BioRanges ?

ADD REPLY
1
Entering edit mode

BioRanges was never finished and I have seen no timings. PyRanges seems to be reaching feature parity with GenomicRanges soon. The greatest difference is perhaps that I try to make a dinky convenient wrapper around pandas dfs so that all the good stuff from GenomicRanges can be used on dfs while still allowing numpy/scipy/pandas to be used directly on the data to operate on it.

Anyways, great q. Something I should update the docs/README with.

ADD REPLY

Login before adding your answer.

Traffic: 1778 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6