Memory error in python
3
0
Entering edit mode
4.6 years ago

Hello, I want to extract the human-HCV(Hepatitis C virus) protein-protein interactions (PPI). For doing this, I have downloaded the entire content of the IntAct database as a .txt file. This .txt file has a huge size (4GB). I tried to convert this text file to a CSV file by Python and then extract just human-HCV PPIs. The problem is the size of the file, and I encounter a memory error.

input:

import pandas as pd

read_file = pd.read_csv('intact.txt', delimiter='\t')
read_file.to_csv('intact.csv', index=None)`

output: `MemoryError: Unable to allocate 162. MiB for an array with shape (41, 1035669) and data type object`

how should I solve this issue? I sincerely would appreciate your help.

Protein-Protein Interaction python memory error • 9.2k views
ADD COMMENT
0
Entering edit mode
import pandas as pd
read_file = pd.read_csv('intact.txt', delimiter='\t')
read_file.to_csv('intact.csv', index=None)
ADD REPLY
0
Entering edit mode

Did you try zero initialization?

read_file = np.zeros(41, 1035669)  # migth require data type...

did you check the usual suspects on stackoverflow, for example https://stackoverflow.com/questions/57507832/unable-to-allocate-array-with-shape-and-data-type

ADD REPLY
0
Entering edit mode

No, I didn't try. Sorry, I'm not an expert in python. Should I put your mentioned part of code before read_file = pd.read_csv('intact.txt', delimiter='\t')

ADD REPLY
1
Entering edit mode
3.6 years ago
linehammer ▴ 10

Memory errors happens a lot with python when using the 32bit Windows version . This is because 32bit processes only gets 2GB of memory to play with by default.

The solution for this error is that pandas.read_csv() function takes an option called dtype. This lets pandas know what types exist inside your csv data.

For example: by specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.

pd.read_csv('data.csv',dtype={'age':int})

Or try the solution below:

pd.read_csv('data.csv',sep='\t',low_memory=False)
ADD COMMENT
0
Entering edit mode
4.6 years ago

You don't need the entire file in memory, and you don't need pandas.

Just loop over the lines in the file, replacing tabs by commas. The following code is untested but should give you the general idea.

output = open("myoutput.csv")
for line in open("myinput.tsv"):
    output.write(line.replace('\t', ','))
ADD COMMENT
0
Entering edit mode
4.6 years ago
Mensur Dlakic ★ 28k

You don't need Pandas for this. Or Python. Or Perl, even though one of my suggestions below uses it.

Copy the file:

cp intact.txt intact.csv

Replace tabs with commas:

perl -pi -e 's/\t/\,/g' intact.csv

or

sed -i 's/\t/\,/g' intact.csv
ADD COMMENT

Login before adding your answer.

Traffic: 2178 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6