How to iterate over rows of pyranges object
2
1
Entering edit mode
3.7 years ago

I'd like to apply a pyranges function on each row of a pyranges object. I then return a row that contains that region, and another region (a so-called "hit") which meets the criteria, whatever that is.

For example, I start with a pyranges object called A, read in from a BED file called A.bed:

A = pyranges.read_bed("A.bed")

I merge it to another object called M.

M = A.merge()

I'd like to run/apply a function on each row of M, which intersects back with A.

For instance, I want to get the maximum-scoring interval in A, among those intervals in A which fall in a merged row in M. (Intervals in A are disjoint.)

Here is one approach I tried, which is very slow:

#!/usr/bin/env python

import sys
import pyranges as pr
import pandas as pd
import numpy as np

in_fn = sys.argv[1]

bed_data = pr.read_bed(in_fn)
bed_merged = bed_data.merge()

def max_scoring_element_over_merged_elements(df):
    new_df = pd.DataFrame()
    for i in range(len(df)):
        pd_row = pd.DataFrame(df.iloc[i]).T.reset_index(drop=True)
        pr_row = pr.from_dict(pd_row)
        candidates = bed_data.intersect(pr_row)
        max_score = np.max(candidates.Score)
        hit = candidates[(candidates.Score == max_score)].head(1) # grab one row, in case of ties
        hit.df.columns = ["Hit{}".format(x) for x in hit.df.columns]
        pd_row = pd.concat([pd_row, hit.df], axis=1)
        new_df = new_df.append(pd_row)
    return new_df

res = bed_merged.apply(max_scoring_element_over_merged_elements)
print(res)

This works on small files, but on anything like a typical input, this takes a very long time to run.

Using bedmap/bedops, for instance, this literally takes a fraction of a second to run on the command line:

$ bedmap --echo --max-element <(bedops --merge A.bed) A.bed > answer.bed

But the Python script above is using 100% CPU and takes at least longer than ten minutes (at which point I canceled it).

Is there a Pythonic/pyrange-ic/idiomatic way to efficiently do per-row operations with pyranges? Thanks!

pyranges • 2.8k views
ADD COMMENT
2
Entering edit mode
3.7 years ago

I think I found a better way to do this operation:

#!/usr/bin/env python

import sys
import pyranges as pr

in_fn = sys.argv[1]
out_fn = sys.argv[2]

bed_data = pr.read_bed(in_fn)
bed_merged = bed_data.merge()
join_merged_to_all = bed_merged.join(bed_data)

def max_scoring_element(df):
    return df \
        .sort_values('Score', ascending=False) \
        .drop_duplicates(['Chromosome', 'Start', 'End', 'Strand'], keep='first') \
        .sort_index() \
        .reset_index(drop=True)

# res is a pyranges object
res = join_merged_to_all.apply(max_scoring_element)

# res.df is a pandas dataframe
res.df.to_csv(out_fn, sep='\t', index=False)

This is still a bit slower than bedmap/bedops, and much more verbose/less maintainable, but being able to run ops within Python without using subprocess might be an acceptable tradeoff.

ADD COMMENT
0
Entering edit mode

Interesting problem, and thanks for sharing a solution. What part of the above max_scoring_element_over_merged_elements() function do you think took the longest time for computation? Is it the candidates = bed_data.intersect(pr_row) part?

ADD REPLY
1
Entering edit mode

That intersect call probably makes it an n^2 operation at that point, as opposed to two nlogn sorts. Converting each dataframe row back to a small pyranges object is likely adding a fair bit of runtime, too, as well as using append to build a new dataframe row by row.

The solution I provide has its own issues. Sorting twice is not great, and for statistical testing, it would be useful to pick from tied-score elements at random with uniform probability, say, as opposed to picking the first element (whatever that happens to be). Dropping duplicates probably uses a ton of memory to keep a second copy of merged intervals, along with the max-scoring element association. I'm not too familiar with pandas and tricks to get the best performance, so any thoughts on how to improve this would be welcome.

ADD REPLY
0
Entering edit mode
2.7 years ago
markfilan ▴ 10

First consider if you really need to iterate over rows in a DataFrame. Iterating through pandas dataFrame objects is generally slow. Iteration beats the whole purpose of using DataFrame. It is an anti-pattern and is something you should only do when you have exhausted every other option. It is better look for a List Comprehensions , vectorized solution or DataFrame.apply() method for iterate through DataFrame.

Pandas DataFrame loop using list comprehension

result = [(x, y,z) for x, y,z in zip(df['Name'], df['Promoted'],df['Grade'])]

Pandas DataFrame loop using DataFrame.apply()

result = df.apply(lambda row: row["Name"] + " , " + str(row["TotalMarks"]) + " , " + row["Grade"], axis = 1)

ADD COMMENT

Login before adding your answer.

Traffic: 2053 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6