Question

Pyranges: operate on just metadata columns

0

Entering edit mode

3.8 years ago

Bosberg ▴ 50

In Granges there's a function mcol to access the columns describing various quantities under observation apart from the 'Canonical' position columns ( like "chrom", "start", "end" +optional "strand" etc. ). I'd like to try to access these entities in pyranges as well. For concreteness, take the following example: say I have the following pyranges object:

>>> test_pr
+--------------+-----------+-----------+-------------+-------------+------------+
| Chromosome   |     Start |       End |    sample_1 |    sample_2 |   sample_3 |
| (category)   |   (int32) |   (int32) |   (float64) |   (float64) |  (float64) |
|--------------+-----------+-----------+-------------+-------------+------------|
| chr1         |     10468 |     10470 |     nan     |    0.1234   |       nan  |
| chr1         |     10470 |     10472 |     0.714   |     0.8     |       0.12 |
| chr1         |     10483 |     10485 |     nan     |     0.6     |       0.13 |
| chr1         |     10488 |     10490 |     0.941   |     0.8     |       0.15 |
+--------------+-----------+-----------+-------------+-------------+-------------+

And my task is to take averages at each position across samples. Hence, I'd like to access just the last 3 columns and obtain the following:

>>> output_pr    
Chromosome      Start   End     Average
chr1    10468   10470   0.1234
chr1    10470   10472   0.544667
chr1    10483   10485   0.365
chr1    10488   10490   0.630333

so I want something like test_pr.mean(skipna=True), but if I try to access the last three columns with, say, iloc for example, I get this:

>>> test_pr.iloc[:,[3,4,5]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../pyranges.py", line 270, in __getattr__
    return _getattr(self, name)
  File ".../pyranges/methods/attr.py", line 66, in _getattr
    raise AttributeError("PyRanges object has no attribute", name)
AttributeError: ('PyRanges object has no attribute', 'iloc')

So I understand that pyranges are not standard pandas dataframes, and I can't use iloc, but I'm not sure how else to manipulate the metadata columns and collect statistics, etc. Solving the above example case would, I think, make it clear how to work with this type of data structure in general.

pyranges pandas column selection • 1.9k views

ADD COMMENT • link 3.8 years ago by Bosberg ▴ 50

0

Entering edit mode

3.8 years ago

Bosberg ▴ 50

Pyranges objects are dictionaries, where the keys are chromosomes (or tuples of chromosome & strand if a stranded object) and the values are pandas dataframes ...

That was an excellently succinct and helpful first sentence that immediately cleared a lot of things up; Thank you!

Btw, the developper graciously sent me the following solution by email (many thanks to Endre for this):

import pandas as pd
import pyranges as pr
import numpy as np

gr = pr.random()
nums = pd.DataFrame(np.random.rand(len(gr), 3))
gr2 = pr.PyRanges(pd.concat([gr.df, nums], axis=1))

def average(df):
    df = df.set_index(["Chromosome", "Start", "End", "Strand"])
    mean = df.mean(axis=1)
    mean.index = df.index
    mean.name = "Average"
    return mean.reset_index()
gr2.apply(average)

ADD COMMENT • link 3.8 years ago by Bosberg ▴ 50

0

Entering edit mode

No worries. Thanks for the share on the solution (and to Endre), that's a really clean approach for taking advantage of pandas's built-in methods, will come in handy for me too!

ADD REPLY • link 3.8 years ago by samuelbrycesmith ▴ 30

score 3 · Accepted Answer · 2021-02-19

Pyranges objects are dictionaries, where the keys are chromosomes (or tuples of chromosome & strand if a stranded object) and the values are pandas dataframes storing the intervals on that chromosome/chromosome & strand pair.

import pyranges as pr
import pandas as pd

test_pr = pr.from_dict({"Chromosome": ["chr1", "chr1", "chr1", "chr1"],
                    "Start": [10468, 10470, 10483, 10485],
                    "End": [10470, 10472, 10485, 10490],
                    "sample_1": [None, 0.714, None, 0.941],
                    "sample_2": [0.1234, 0.8, 0.6, 0.8],
                    "sample_3": [None, 0.12, 0.13, 0.15]})

test_pr.dfs
    {'chr1':   Chromosome  Start    End  sample_1  sample_2  sample_3
    0       chr1  10468  10470       NaN    0.1234       NaN
    1       chr1  10470  10472     0.714    0.8000      0.12
    2       chr1  10483  10485       NaN    0.6000      0.13
    3       chr1  10485  10490     0.941    0.8000      0.15}

Anytime you want to access the dataframes and perform an operation on the metadata, you can use one of pyranges's functions like assign (insert a new column) or apply that take as input a function that is applied to each dataframe in the pyranges object. In these functions you can use your standard pandas-fu to do the operations on the underlying dataframes.

A few ways to answer your question by adapting your .iloc approach (I'm not particularly savvy with pandas so there may be more straightforward solutions).

First, make a new column using pr.assign and an anonymous function applied to dataframe, then a call to pr.drop to remove the sample_* columns:

(test_pr.assign("mean",
                lambda df: df.iloc[:, [3,4,5]].mean(axis = 1))
.drop(["sample_1","sample_2","sample_3"]))

Or you could make the new column and drop the metadata columns inside the anonymous function itself, and use pr.apply to return the modified pyranges:

(test_pr.apply(lambda df: 
               (df.assign(mean = lambda df: df.iloc[:, [3,4,5]].mean(axis=1))
                .drop(["sample_1","sample_2","sample_3"], axis = 1))
              )
)

Both approaches return your desired output. Again there may be leaner solutions out there, but both get the job done and get you under the hood to the dataframes.

Hope that helps :)