Entering edit mode
6.8 years ago
Mostafa Mahmoud
•
0
hi i made py script to split uniq and identical seq in fasta file but it actully too slow it finish 100mb file in 3 days and my data is about 1gb 18000000 seq if there is faster solution.
f=open('example2.fas','r')
f1 = open(r'identical.fasta','w')
f2 = open(r'uniq.fasta','w')
seq1=[]
ID=[]
i=0
for line in f.readlines():
i+=1
if i%2 != 0:
ID.append(line)
else:
seq1.append(line)
for i in range(0,len(seq1)):
if seq1.count(seq1[i])>=2:
f1.write(ID[i]+seq1[i])
else:
f2.write(ID[i]+seq1[i])
f.close()
f1.close()
f2.close()
You are storing everything in memory, do except if you have a lot of RAM I think runtime would not be your biggest problem for the full dataset.
i dont quit understand i am beginner in py and i run thise script on HPC 125000gb ram and 46 core. could you give me and example or any kind of tool do the same jop.
I think your approach is naive but not too bad. Given that you have 125000gb RAM you might have enough.
A few suggestions:
f.readlines()
. Just usefor line in f
. The file itself is an iterator, no need to read the full file in memory firststill the same problem also i dont know how to hash the seq.
If only we had method to search the internet, googling "python hash a string" gave me https://www.pythoncentral.io/hashing-strings-with-python/
Without hashing you could still modify seq1 to only contain the duplicate elements, that will make searching much faster.
what if i convert the data into tabular ID $1 and seq $2 there is possibility to do the job awk or grep
Do you have a multi-line or single line fasta? What is the output of
grep -c '>' example2.fas
?full dataset 18255486
what if i convert the data into tabular ID $1 and seq $2 there is possibility to do the job awk or grep
awk
orgrep
will also work with FASTA. It is not the format of your data, it is the quantity, and efficiency of your script. Do you know the average length of your sequences? Again, do you have a multi-line or single-line FASTA? This will help in the efficacy of reading the file.You need a much more sophisticated approach to reduce run time. But I would first check whether you are really looking for identical sequences, such that one base error would be a new sequence or are you looking for similarity only.