Question

genomic range overlap using pandas memory issue

0

Entering edit mode

7.6 years ago

badribio ▴ 290

I am trying to use the exact same code to perform some task, but my system freezes I guess because it cannot handle this operation, any easier way to optimize this using python pandas or just python?

http://stackoverflow.com/questions/17457142/merging-files-based-on-column-coordinates-of-two-files-in-python

new_df = df.merge(gene_df, how='outer', on ='chrm')
new_df = new_df[(new_df.start_x>=df.start_y) & (df.end_x<=df.end_y)] ##this line consumes memory
print (new_df.head(10))

I have posted this on SO as well..

pandas python gene • 4.5k views

ADD COMMENT • link updated 7.2 years ago by Biostar 20 • written 7.6 years ago by badribio ▴ 290

0

Entering edit mode

How big are new_df and df? Is it possible you're actually running out of memory?

ADD REPLY • link 7.6 years ago by jonasmst ▴ 410

0

Entering edit mode

i am assigning new_df to df (query co ordinates) and gene_df(which has co ordinates and gene_id and name from biomart) df has ~300000 rows

ADD REPLY • link 7.6 years ago by badribio ▴ 290

1

Entering edit mode

Hmm, 300K rows sounds manageable within any modern computer's memory. I'm wondering about your slicing, however. Including both new_df and df in your second line is unnecessarily memory consuming, as new_df includes everything from df and gene_df from the outer join. Try this instead:

new_df = new_df.loc[(new_df["start_x"] >= new_df["start_y"]) & (new_df["end_x"] <= new_df["end_y"])]

Also, it would be helpful if you could monitor your memory to see if that's the issue, then we'll have to come up with another implementation.

ADD REPLY • link 7.6 years ago by jonasmst ▴ 410

0

Entering edit mode

@joansmst it did not work!! :( I do not think this should be so complicated or may be I should quite using pandas for this operation.

ADD REPLY • link 7.6 years ago by badribio ▴ 290

0

Entering edit mode

I'll need you to elaborate on what is going wrong. Your title says it's a memory issue and your code snippet comments a line that consumes memory. All code consumes memory, does that particular line consume too much memory? How do you know? Do you get an error message? How much RAM do you have? If you're running out of memory, there's no need to continue with this Pandas approach; you'd either have to purchase more memory or use a generator or something (alternatively look into something like dask). Pandas uses vectorized operations, which are super fast but require a lot of memory.

ADD REPLY • link 7.6 years ago by jonasmst ▴ 410

0

Entering edit mode

hey @jonasmst I am trying to write a single line something similar to bedtools intersect using pandas if my co ordinates fall with in a certain region get the gene id and gene name, I using pandas because up stream and downstream the code is written in pandas by earlier person. the issue is certainly the code new_df = new_df.loc[(new_df["start_x"] >= new_df["start_y"]) & (new_df["end_x"] <= new_df["end_y"])] because i have many rows it stores it in the memory. I cannot give you exact error as my system just freezes have to restart it.

If you could suggest a way you would have approach this problem i will build up on it.

for example given co ordinates i would like to get eng gene id and name (~rows 200000) these are splice junction co ordinates from junctions.bed file (tophat2)

  1 4247268 4247956 -
  1 4427432 4432686 +
  1 4763333 4764422 -
  1 4764498 4767704 -
  1 4764523 4766589 -
  1 4766783 4767704 -
  1 4767630 4772747 -
  1 4767682 4768880 -
  1 4767687 4775711 -
  1 4772284 4772747 -
  1 4772715 4774130 -
  1 4772715 4774184 -

thanks.

ADD REPLY • link 7.6 years ago by badribio ▴ 290

score 3 · Accepted Answer · 2017-05-11

3

Entering edit mode

7.6 years ago

jonasmst ▴ 410

Allright, I think I've understood what you're after. Try this (much slower) approach:

import pandas as pd

# DataFrame representing your "gene_df"
gene_df = pd.DataFrame({
    "start": [11, 41],
    "end": [19, 49],
    "chrm": [1, 1],
    "strand": ["-", "+"]
})

# DataFrame representing your "df"
df = pd.DataFrame({
    "start": [10, 40, 100],
    "end": [20, 50, 150],
    "chrm": [1, 1, 2],
    "gene_name": ["gene1", "gene2", "gene3"],
    "gene_id": ["g1", "g2", "g3"]
})

# Lookup regions from "gene_df" in "df", one row at a time (to save memory)
all_results = []  # A temporary container to store our results
for index, row in gene_df.iterrows():
    # Lookup region and store result as DataFrame
    # Note: I didn't understand what you meant by overlap, so this searches for
    # regions in df that overlap regions in gene_df. If you're looking for the opposite,
    # just reverse the <= to >= and >= to <=
    results_df = df.loc[(df["chrm"] == row["chrm"]) & (df["start"] <= row["start"]) & (df["end"] >= row["end"])]
    # UPDATE: Add coordinates from gene file to the result DataFrame
    results_df["gene_df_start"] = row["start"]
    results_df["gene_df_end"] = row["end"]
    # Store results in our container
    all_results.append(results_df)

# When done with all rows, gather all results into a single DataFrame
finished_df = pd.concat(all_results)

print finished_df

Try with only a few lines in your "gene_df" first, like 10 or so (that you know exists in your "df").

Edit: finished_df now looks like this:

   chrm  end gene_id gene_name  start
0     1   20      g1     gene1     10
1     1   50      g2     gene2     40

Edit2: finished_df is now:

   chrm  end gene_id gene_name  start  gene_df_start  gene_df_end
0     1   20      g1     gene1     10             11           19
1     1   50      g2     gene2     40             41           49

ADD COMMENT • link 7.6 years ago by jonasmst ▴ 410

0

Entering edit mode

I am looking for regions in gene_df that fall with in the range of a co ordinates from ensemble gene start and end. and if it does then print the co ordinates from gene_df and gene_id and gene_name from df

ADD REPLY • link 7.6 years ago by badribio ▴ 290

1

Entering edit mode

~~Yes, that's what it does.~~ The finished_df contains all the fields from df (including gene_id and gene_name, if they're there) for regions that overlap the coordinates given in gene_df. Did you try it? Edit: Oh, my bad, it doesn't contain the coordinates from gene_df, I'll update the answer to include it. Edit2: Done, see updated answer.

ADD REPLY • link 7.6 years ago by jonasmst ▴ 410

0

Entering edit mode

Thanks, i did some mods too..but appreciate your help was able to build on based on your answer...pandas is so cool no more complex for if loops and conditions...

ADD REPLY • link 7.6 years ago by badribio ▴ 290

0

Entering edit mode

when I use it on my original data it some how does not work not sure why,

code i used:

#from biostars user: jonasmst
#C: genomic range overlap using pandas memory issue
df2 = pd.DataFrame(pd.read_csv("~/gene_annotations.txt", sep='\t'))
df2.strand.replace("-1", "-", inplace=True)
df2.strand.replace(1, "+", inplace=True)
print df2.head(5)

gene_df2 = pd.DataFrame({
    "start": [4247268, 4427432],
    "end": [4247956, 4432686],
    "chrom": [1, 1],
    "strand": ["-", "+"]
print (gene_df2.head(5))
print gene_df2.dtypes

all_results = []  # A temporary container to store our results
for index, row in gene_df2.iterrows():
    print index,row
    # just reverse the <= to >= and >= to <=
    results_df = df2.loc[(df2["chrom"] == row["chrom"]) & (df2["start"] >= row["start"]) & (df2["end"] <= row["end"])]
    # UPDATE: Add coordinates from gene file to the result DataFrame
    results_df["gene_df2_start"] = row["start"]
    results_df["gene_df2_end"] = row["end"]
    # Store results in our container
    all_results.append(results_df)
#for i in all_results:
 #   print i
# When done with all rows, gather all results into a single DataFrame
finished_df = pd.concat(all_results)

print finished_df

    `  chrom     start       end             gene_id    gene_name strand
    0    17  71223692  71274336  ENSMUSG00000085299      Gm16627      -
    1    17  18186448  18211184  ENSMUSG00000067978  Vmn2r-ps113      +
    2    11  84645863  84684319  ENSMUSG00000020530       Ggnbp2      -
    3     7  51097639  51106551  ENSMUSG00000074155         Klk5      +
    4    13  31711037  31712238  ENSMUSG00000087276      Gm11378      +
       chrom      end    start strand
    0      1  4247956  4247268      -
    1      1  4432686  4427432      +
    chrom      int64
    end        int64
    start      int64
    strand    object
    dtype: object
    0 chrom           1
    end       4247956
    start     4247268
    strand          -
    Name: 0, dtype: object
    1 chrom           1
    end       4432686
    start     4427432
    strand          +
    Name: 1, dtype: object
    Empty DataFrame
    Columns: [chrom, start, end, gene_id, gene_name, strand, gene_df2_start, gene_df2_end]
    Index: []`

ADD REPLY • link 7.6 years ago by badribio ▴ 290

0

Entering edit mode

Are there regions in your gene_annotations.txt that span the coordinates in gene_df2? The empty DataFrame in your output suggest there are none. But I don't know a lot about your original file (only the first 5 rows), so it's hard for me to tell if it's a bug or if it works as intended.

ADD REPLY • link 7.6 years ago by jonasmst ▴ 410

score 2 · Accepted Answer · 2017-05-11

2

Entering edit mode

7.6 years ago

Alex Reynolds 36k

One way to efficiently solve this is with BEDOPS bedmap:

$ bedmap --echo --echo-map-id --echo-map-score --delim '\t' snp.bed gene.bed > output.txt

It is written to work with whole genome-scale inputs.

If you absolutely must do this with Python, you can use the subprocess library to run Unix processes (such as these).

ADD COMMENT • link 7.6 years ago by Alex Reynolds 36k

0

Entering edit mode

Hello @Alex Reynolds, I am aware of bedmap, but I do not want to change the structure of the dataframe in other words if i use bedmap then I will have to write the output for each row again from previous file + bedmap output, hope it makes sense

ADD REPLY • link 7.6 years ago by badribio ▴ 290