Question

Memory error in python

0

Entering edit mode

4.6 years ago

saber mohammadi ▴ 20

Hello, I want to extract the human-HCV(Hepatitis C virus) protein-protein interactions (PPI). For doing this, I have downloaded the entire content of the IntAct database as a .txt file. This .txt file has a huge size (4GB). I tried to convert this text file to a CSV file by Python and then extract just human-HCV PPIs. The problem is the size of the file, and I encounter a memory error.

input:

import pandas as pd

read_file = pd.read_csv('intact.txt', delimiter='\t')
read_file.to_csv('intact.csv', index=None)`

output: `MemoryError: Unable to allocate 162. MiB for an array with shape (41, 1035669) and data type object`

how should I solve this issue? I sincerely would appreciate your help.

Protein-Protein Interaction python memory error • 9.2k views

ADD COMMENT • link updated 3.6 years ago by linehammer ▴ 10 • written 4.6 years ago by saber mohammadi ▴ 20

0

Entering edit mode

import pandas as pd
read_file = pd.read_csv('intact.txt', delimiter='\t')
read_file.to_csv('intact.csv', index=None)

ADD REPLY • link 4.6 years ago by saber mohammadi ▴ 20

0

Entering edit mode

Did you try zero initialization?

read_file = np.zeros(41, 1035669)  # migth require data type...

did you check the usual suspects on stackoverflow, for example https://stackoverflow.com/questions/57507832/unable-to-allocate-array-with-shape-and-data-type

ADD REPLY • link 4.6 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

No, I didn't try. Sorry, I'm not an expert in python. Should I put your mentioned part of code before read_file = pd.read_csv('intact.txt', delimiter='\t')

ADD REPLY • link 4.6 years ago by saber mohammadi ▴ 20

score 1 · Answer 1 · 2021-05-10

Memory errors happens a lot with python when using the 32bit Windows version . This is because 32bit processes only gets 2GB of memory to play with by default.

The solution for this error is that pandas.read_csv() function takes an option called dtype. This lets pandas know what types exist inside your csv data.

For example: by specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.

pd.read_csv('data.csv',dtype={'age':int})

Or try the solution below:

pd.read_csv('data.csv',sep='\t',low_memory=False)

score 0 · Answer 2 · 2020-04-07

You don't need the entire file in memory, and you don't need pandas.

Just loop over the lines in the file, replacing tabs by commas. The following code is untested but should give you the general idea.

output = open("myoutput.csv")
for line in open("myinput.tsv"):
    output.write(line.replace('\t', ','))

score 0 · Answer 3 · 2020-04-07

0

Entering edit mode

4.6 years ago

Mensur Dlakic ★ 28k

You don't need Pandas for this. Or Python. Or Perl, even though one of my suggestions below uses it.

Copy the file:

cp intact.txt intact.csv

Replace tabs with commas:

perl -pi -e 's/\t/\,/g' intact.csv

or

sed -i 's/\t/\,/g' intact.csv

ADD COMMENT • link 4.6 years ago by Mensur Dlakic ★ 28k