Dear all,
I have several FASTA nucleotide file, containing millions identifiers and their sequence. Among them, there are some identifier like this, ">AT1G01340|AT1G01340.2 Sequence unavailable". Would you please let me know how I can remove them? looking forward to hearing your helpful commands. thanks
wouldn't this work for you ? It would remove identifiers where sequence is not available
save it with .pl extension and run it on your file.
hth
Thanks, but it does not work at all.
Oh yes, I had thought that "Sequence unavailable" means that you have empty lines.
Anyways David and Siva have already given you answer, nevertheless, you can edit above script's last line
Instead of
(length $seq) > 1
, write($seq) !~ "Sequence"
It should work. Did you run it with your file? (
perl script.pl input.fasta
)Would this not print everything? "Sequence unavailable" would also satisfy the condition.
Oh yes, I had thought that "Sequence unavailable" means that he has empty lines.
Yeah, it was not clear from the OP. I first thought that "Sequence unavailable" was part of the FASTA header until they posted an actual example.
OP asked this as a follow up to another question I answered yesterday. It was midnight so I postponed answering the follow up question, and woke up to a new question addressing this specific issue - I guess OP got a bit impatient and opened a new post without giving it the full context.
Oh! Yes. I remember. Its the post where you and Pierre were answering.
Dear seta, keep patience dear, its very important in science.
And if you take the code snippets, please try to understand them and don't just use them. Learning by doing, you know? ;)
Yeah, I don't mind though - less work for me :)