I'd like to apply a pyranges function on each row of a pyranges object. I then return a row that contains that region, and another region (a so-called "hit") which meets the criteria, whatever that is.
For example, I start with a pyranges object called A
, read in from a BED file called A.bed
:
A = pyranges.read_bed("A.bed")
I merge it to another object called M
.
M = A.merge()
I'd like to run/apply a function on each row of M
, which intersects back with A
.
For instance, I want to get the maximum-scoring interval in A
, among those intervals in A which fall in a merged row in M
. (Intervals in A
are disjoint.)
Here is one approach I tried, which is very slow:
#!/usr/bin/env python
import sys
import pyranges as pr
import pandas as pd
import numpy as np
in_fn = sys.argv[1]
bed_data = pr.read_bed(in_fn)
bed_merged = bed_data.merge()
def max_scoring_element_over_merged_elements(df):
new_df = pd.DataFrame()
for i in range(len(df)):
pd_row = pd.DataFrame(df.iloc[i]).T.reset_index(drop=True)
pr_row = pr.from_dict(pd_row)
candidates = bed_data.intersect(pr_row)
max_score = np.max(candidates.Score)
hit = candidates[(candidates.Score == max_score)].head(1) # grab one row, in case of ties
hit.df.columns = ["Hit{}".format(x) for x in hit.df.columns]
pd_row = pd.concat([pd_row, hit.df], axis=1)
new_df = new_df.append(pd_row)
return new_df
res = bed_merged.apply(max_scoring_element_over_merged_elements)
print(res)
This works on small files, but on anything like a typical input, this takes a very long time to run.
Using bedmap
/bedops
, for instance, this literally takes a fraction of a second to run on the command line:
$ bedmap --echo --max-element <(bedops --merge A.bed) A.bed > answer.bed
But the Python script above is using 100% CPU and takes at least longer than ten minutes (at which point I canceled it).
Is there a Pythonic/pyrange-ic/idiomatic way to efficiently do per-row operations with pyranges? Thanks!
Interesting problem, and thanks for sharing a solution. What part of the above
max_scoring_element_over_merged_elements()
function do you think took the longest time for computation? Is it thecandidates = bed_data.intersect(pr_row)
part?That
intersect
call probably makes it an n^2 operation at that point, as opposed to two nlogn sorts. Converting each dataframe row back to a small pyranges object is likely adding a fair bit of runtime, too, as well as usingappend
to build a new dataframe row by row.The solution I provide has its own issues. Sorting twice is not great, and for statistical testing, it would be useful to pick from tied-score elements at random with uniform probability, say, as opposed to picking the first element (whatever that happens to be). Dropping duplicates probably uses a ton of memory to keep a second copy of merged intervals, along with the max-scoring element association. I'm not too familiar with pandas and tricks to get the best performance, so any thoughts on how to improve this would be welcome.