I have been trying to transpose my FinalReport table of 2000000+ rows and 300+ columns on a cluster, but it seems that my Python script is getting killed due to lack of memory. I would just like to know if anyone has any suggestions on a more efficient way to store my table data other than using the array, as shown in my code below?
import sys
Seperator = "\t"
m = []
f = open(sys.argv[1], 'r')
data = f.read()
lines = data.split("\n")[:-1]
for line in lines:
m.append(line.strip().split("\t"))
for i in zip(*m):
for j in range(len(i)):
if j != len(i):
print(i[j] +Seperator)
else:
print(i[j])
print ("\n")
I'm not able to finishing storing the entire array, it always gets killed towards the end of the 'm.append' step. Out of the 2379856 lines, the furthest I've gotten is 2321894 lines. I got the numbers by printing a number count after the append line.
Thanks very much in advance!
Do you need to store it in memory? Can you not use the transpose function in numpy (assuming you're using numpy)?
Oh, I'm not using numpy because my data is alphanumerical.
It's still faster and more memory efficient in numpy. Alternatively, if the file is much larger than the RAM you have available, do multiple passes over the file so you just process a column (or a few) at a time.
Have you tried CSVTK's transpose (here)? The tools is working well on my tables, but these aren't that big.