Allright, I think I've understood what you're after. Try this (much slower) approach:
import pandas as pd
# DataFrame representing your "gene_df"
gene_df = pd.DataFrame({
"start": [11, 41],
"end": [19, 49],
"chrm": [1, 1],
"strand": ["-", "+"]
})
# DataFrame representing your "df"
df = pd.DataFrame({
"start": [10, 40, 100],
"end": [20, 50, 150],
"chrm": [1, 1, 2],
"gene_name": ["gene1", "gene2", "gene3"],
"gene_id": ["g1", "g2", "g3"]
})
# Lookup regions from "gene_df" in "df", one row at a time (to save memory)
all_results = [] # A temporary container to store our results
for index, row in gene_df.iterrows():
# Lookup region and store result as DataFrame
# Note: I didn't understand what you meant by overlap, so this searches for
# regions in df that overlap regions in gene_df. If you're looking for the opposite,
# just reverse the <= to >= and >= to <=
results_df = df.loc[(df["chrm"] == row["chrm"]) & (df["start"] <= row["start"]) & (df["end"] >= row["end"])]
# UPDATE: Add coordinates from gene file to the result DataFrame
results_df["gene_df_start"] = row["start"]
results_df["gene_df_end"] = row["end"]
# Store results in our container
all_results.append(results_df)
# When done with all rows, gather all results into a single DataFrame
finished_df = pd.concat(all_results)
print finished_df
Try with only a few lines in your "gene_df" first, like 10 or so (that you know exists in your "df").
Edit: finished_df
now looks like this:
chrm end gene_id gene_name start
0 1 20 g1 gene1 10
1 1 50 g2 gene2 40
Edit2: finished_df
is now:
chrm end gene_id gene_name start gene_df_start gene_df_end
0 1 20 g1 gene1 10 11 19
1 1 50 g2 gene2 40 41 49
How big are
new_df
anddf
? Is it possible you're actually running out of memory?i am assigning new_df to df (query co ordinates) and gene_df(which has co ordinates and gene_id and name from biomart) df has ~300000 rows
Hmm, 300K rows sounds manageable within any modern computer's memory. I'm wondering about your slicing, however. Including both
new_df
anddf
in your second line is unnecessarily memory consuming, asnew_df
includes everything fromdf
andgene_df
from the outer join. Try this instead:Also, it would be helpful if you could monitor your memory to see if that's the issue, then we'll have to come up with another implementation.
@joansmst it did not work!! :( I do not think this should be so complicated or may be I should quite using pandas for this operation.
I'll need you to elaborate on what is going wrong. Your title says it's a memory issue and your code snippet comments a line that consumes memory. All code consumes memory, does that particular line consume too much memory? How do you know? Do you get an error message? How much RAM do you have? If you're running out of memory, there's no need to continue with this Pandas approach; you'd either have to purchase more memory or use a generator or something (alternatively look into something like dask). Pandas uses vectorized operations, which are super fast but require a lot of memory.
hey @jonasmst I am trying to write a single line something similar to bedtools intersect using pandas if my co ordinates fall with in a certain region get the gene id and gene name, I using pandas because up stream and downstream the code is written in pandas by earlier person. the issue is certainly the code
new_df = new_df.loc[(new_df["start_x"] >= new_df["start_y"]) & (new_df["end_x"] <= new_df["end_y"])]
because i have many rows it stores it in the memory. I cannot give you exact error as my system just freezes have to restart it.If you could suggest a way you would have approach this problem i will build up on it.
for example given co ordinates i would like to get eng gene id and name (~rows 200000) these are splice junction co ordinates from junctions.bed file (tophat2)
thanks.