Question

split seq in fasta file in groups according to seq similarity?

0

Entering edit mode

6.9 years ago

Mostafa Mahmoud • 0

hi i made py script to split uniq and identical seq in fasta file but it actully too slow it finish 100mb file in 3 days and my data is about 1gb 18000000 seq if there is faster solution.

f=open('example2.fas','r')
f1 = open(r'identical.fasta','w')
f2 = open(r'uniq.fasta','w')

seq1=[]
ID=[]

i=0
for line in f.readlines():
    i+=1
    if i%2 != 0:
        ID.append(line)
    else:
        seq1.append(line)

for i in range(0,len(seq1)):
    if seq1.count(seq1[i])>=2:
        f1.write(ID[i]+seq1[i])
    else:
        f2.write(ID[i]+seq1[i])

f.close()
f1.close()
f2.close()

python • 2.3k views

ADD COMMENT • link updated 4.6 years ago by Biostar 20 • written 6.9 years ago by Mostafa Mahmoud • 0

0

Entering edit mode

You are storing everything in memory, do except if you have a lot of RAM I think runtime would not be your biggest problem for the full dataset.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

i dont quit understand i am beginner in py and i run thise script on HPC 125000gb ram and 46 core. could you give me and example or any kind of tool do the same jop.

ADD REPLY • link 6.9 years ago by Mostafa Mahmoud • 0

0

Entering edit mode

I think your approach is naive but not too bad. Given that you have 125000gb RAM you might have enough.

A few suggestions:

don't use f.readlines(). Just use for line in f. The file itself is an iterator, no need to read the full file in memory first
I would try to use hashing of the sequences you get. In a first iteration, hash all sequences. Then, search all hashes which are duplicates, only keep those. In the second iteration, check if the sequence is a duplicate yes or no and write to the appropriate file. That might keep memory usage low, and I'd expect it to be faster although I have never done anything like that. I also don't have an appropriate dataset to test on.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

still the same problem also i dont know how to hash the seq.

ADD REPLY • link 6.9 years ago by Mostafa Mahmoud • 0

0

Entering edit mode

If only we had method to search the internet, googling "python hash a string" gave me https://www.pythoncentral.io/hashing-strings-with-python/

Without hashing you could still modify seq1 to only contain the duplicate elements, that will make searching much faster.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

what if i convert the data into tabular ID $1 and seq $2 there is possibility to do the job awk or grep

ADD REPLY • link 6.9 years ago by Mostafa Mahmoud • 0

0

Entering edit mode

Do you have a multi-line or single line fasta? What is the output of grep -c '>' example2.fas?

ADD REPLY • link 6.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

full dataset 18255486

ADD REPLY • link 6.9 years ago by Mostafa Mahmoud • 0

0

Entering edit mode

what if i convert the data into tabular ID $1 and seq $2 there is possibility to do the job awk or grep

ADD REPLY • link 6.9 years ago by Mostafa Mahmoud • 0

0

Entering edit mode

awk or grep will also work with FASTA. It is not the format of your data, it is the quantity, and efficiency of your script. Do you know the average length of your sequences? Again, do you have a multi-line or single-line FASTA? This will help in the efficacy of reading the file.

ADD REPLY • link 6.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

You need a much more sophisticated approach to reduce run time. But I would first check whether you are really looking for identical sequences, such that one base error would be a new sequence or are you looking for similarity only.

ADD REPLY • link 6.9 years ago by jomo018 ▴ 730