Dear all,
I am trying to emove fasta sequences from a fasta file based on a txt file with ID's that need to be removed. The fasta file contains 44306447 reads (RNAseq data). I have tried to use the following script from another thread: How To Remove Certain Sequences From A Fasta File
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from Bio import SeqIO
fasta_file = sys.argv[1] # Input fasta file
remove_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file
remove = set()
with open(remove_file) as f:
for line in f:
line = line.strip()
if line != "":
remove.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
for seq in fasta_sequences:
**nam = seq.id
**nuc = str(seq.seq)
**if nam not in remove and len(nuc) > 0:
SeqIO.write([seq], f, "fasta")
But unfortunately using grep ">" does result in the exact same number of reads, so I am concluding there must be a mistake in the code somewhere but I cannot figure out where. Is someone able to help me out here to get a script running that removes reads from a fasta file based on a text file with headers and that can handle LARGE files?
Thanks!
In your remove file, do the sequence ids still have the fasta ">" prepended?
Why do you use
**name
,**nuc
,**if
in the last for loop? Is that meant to be there?Hello,
do you really need your own python script? There are several program out there, that do this task for your, e.g. seqkit
fin swimmer
While I agree with "don't reinvent the wheel", I also appreciate when people try and solve such things by their own - how else are you supposed to learn?
Yes, you're right. As I don't know why the OP likes to use python I asked for it. There might be several reasons. One is he/she likes to learn scripting, another is he/she doesn't know it already exists a tool for it. So I made a suggestion and one can choose to use it or not. Even if one like to learn scripting, it is good to know that there other tools that can do the job, so one could use it to validate it's own program.
Some comments on the code of the OP for learning purpose:
Why using a
set
? Can there be duplicates in the input file?This can be shorten, by reading the whole file at once and split by line.
There is no need to use
open()
here, BioPython do this for you.As other mention before, what are the
**
for? Also copy the id and seq into new variable can be ommited here. In my opinion this doesn't help for code readability (personal opinion). Can there be sequences with no sequence? That would be weired. I guess you can remove the check for it.To summarize the script could look like this:
fin swimmer
Sorry, I didn't want to sound harsh. I agree with that sentiment, OP might not know about available tools, so of course suggesting seqtk makes sense!
Concerning your remark about the use of a set. We don't expect duplicates, but using a list, as you suggest, requires linear search to check for each read name. In order to decrease that, using a set makes the most sense since it uses hashing. A dict would work as well, but then there's the memory overhead associated with that. A frozenset would probably be the best solution.
Didn't know that
set
using hashing. Thanks for pointing out!Thanks, all. This indeed solved the issue and it is running fine now!
And why using script, it's fun to get it running!
Hello T_18,
If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Done, thanks.. Never did this before.
Which comment above resolved the issue? I can them move that to an
answer
so you can accept it.I used the fix of goodez. Although reading the script of finswimmer must solve the issue also (but did not test).
Hi,
I tried to run this script but it returns this error:
should the fasta file be of special format? mine starts like this:
dear mthm, you're posting a question as an answer to a 2.6y old post. You'd rather post your question as a separate question.
Spoiler: You might want to use your fasta file as input argument to your script ;-)
thanks I realized my mistake after posting it, but was too late:) I thought this issue is too minor to be posted separately, sorry.