Removing identifier with "unavailable sequence" from FASTA file
3
1
Entering edit mode
9.9 years ago
seta ★ 1.9k

Dear all,

I have several FASTA nucleotide file, containing millions identifiers and their sequence. Among them, there are some identifier like this, ">AT1G01340|AT1G01340.2 Sequence unavailable". Would you please let me know how I can remove them? looking forward to hearing your helpful commands. thanks

sequence genome blast RNA-Seq • 6.6k views
ADD COMMENT
0
Entering edit mode

wouldn't this work for you ? It would remove identifiers where sequence is not available

#!/usr/bin/perl
use strict;
use warnings;

$/="\n>";
while (<>) {
 s/>//g;
 my ($id, $seq) = split (/\n/, $_);
 print ">$_" if ((length $seq) > 1);
}

save it with .pl extension and run it on your file.

hth

ADD REPLY
0
Entering edit mode

Thanks, but it does not work at all.

ADD REPLY
1
Entering edit mode

Oh yes, I had thought that "Sequence unavailable" means that you have empty lines.

Anyways David and Siva have already given you answer, nevertheless, you can edit above script's last line

Instead of (length $seq) > 1, write ($seq) !~ "Sequence"

ADD REPLY
0
Entering edit mode

It should work. Did you run it with your file? (perl script.pl input.fasta)

ADD REPLY
0
Entering edit mode
print ">$_" if ((length $seq) > 1);

Would this not print everything? "Sequence unavailable" would also satisfy the condition.

ADD REPLY
0
Entering edit mode

Oh yes, I had thought that "Sequence unavailable" means that he has empty lines.

ADD REPLY
0
Entering edit mode

Yeah, it was not clear from the OP. I first thought that "Sequence unavailable" was part of the FASTA header until they posted an actual example.

ADD REPLY
0
Entering edit mode

OP asked this as a follow up to another question I answered yesterday. It was midnight so I postponed answering the follow up question, and woke up to a new question addressing this specific issue - I guess OP got a bit impatient and opened a new post without giving it the full context.

ADD REPLY
1
Entering edit mode

Oh! Yes. I remember. Its the post where you and Pierre were answering.

Dear seta, keep patience dear, its very important in science.

ADD REPLY
3
Entering edit mode

And if you take the code snippets, please try to understand them and don't just use them. Learning by doing, you know? ;)

ADD REPLY
1
Entering edit mode

Yeah, I don't mind though - less work for me :)

ADD REPLY
3
Entering edit mode
9.9 years ago
Manvendra Singh ★ 2.2k

I have modified the perl script which should work for your cause

#!/usr/bin/perl
use strict;
use warnings;

$/="\n>";

while (<>) {
 s/>//g;
  my ($id, $seq) = split (/\n/, $_);
  print ">$_" if ((length $seq) > 10 && $seq !~ "Sequence unavailable");
}

hth

ADD COMMENT
0
Entering edit mode

thanks a lot dear friend.

ADD REPLY
2
Entering edit mode
9.9 years ago

If there is no sequence below these headers, you could try something like this:

grep -v "Sequence unavailable" file.fasta > newFile.fasta

It will delete all headers containing "Sequence unavailable". But before you do that, double-check that there is no sequence under these headers!

ADD COMMENT
0
Entering edit mode

Oh! I provided above script where sequences is not available at all.

Does he have the "Sequence unavailable" as the text in fasta record or its just empty which he has named as sequence unavailable?

ADD REPLY
0
Entering edit mode

The "Sequence unavailable" exists as text, my mean is exactly like,">AT 1G01340|AT1G01340.

Sequence unavailable". thanks for your help me in advance.

ADD REPLY
0
Entering edit mode

Thanks, using this command just words "sequence unavailable" is disappeared, but its identifier is remained. would please help me to remove both of them from a fasta file?

ADD REPLY
0
Entering edit mode

Can you provide an example of your fasta file? Just... 5 entries with at least one being the "sequence unavailable"? :)

ADD REPLY
0
Entering edit mode

Does it look like 1, 2?

1.

>AT 1G01340|AT1G01340.
Sequence unavailable

2.

>AT 1G01340|AT1G01340. Sequence unavailable
ADD REPLY
0
Entering edit mode

Sure. that's it:

>AT1G07745|AT1G07745.2
GCCTCCTTGGCTATATATTCCACGCCCATCGTTTGGATGTGCAAGAAGAGAACCACATAA
ATGAATGATCAGATGGATTTCTTGTTTCGGTTTTACATGTAGAATTATACTGCAGCAATG
AAAATTAAAATTTGTTGAACAAATGAGTGGTGTGTATATAAGGATCACATGAATATAGAA
TCACATTTTAAAACTTTGCGCGCACACC
>AT1G08845|AT1G08845.2
Sequence unavailable
>AT1G11870|AT1G11870.1
TTTTTAATGGTTAGCTCTTTGCAATTCTTGCTGTGATTCCTCTTGTACTTGAGGTTTTGT
TCATTATACTTTTCACAGGCGCTGAGTTTTTTGTGTTCTTGTCTCAGAACATACCATTTA
TAGTTTATATCATCATAACCGATGTTCTTTGATTCAATAATAACGAAAGGGAAGTATACA
TT
>AT1G10670|AT1G10670.3
GCTTCTCTCAGTACCTTCTATGACCAAAACTTTGTCTGTGTTTTAGAGCCTTTATTTACT
GTGGTTAAGATTACTCAAGCTATAAGATACTTGCAATTTCTTGAAAACTTCTGTTGTTCG
ATTCTCTTTCCCCTAACGTTTTCTTCAGATTCAATAAATAATCGTTACTTTTTTTTCCCC
TCT

>AT1G08010|AT1G08010.2
Seqence unavailable
ADD REPLY
4
Entering edit mode

Try this:

grep -v "$(grep -B 1 "Sequence unavailable" file.fa)" file.fa
ADD REPLY
1
Entering edit mode

Many thanks David, it's great. It wil be great if you please share me how your perfect command is applied for removing short sequences( say less than 10 bp) with their identifiers?

ADD REPLY
0
Entering edit mode

Well, that's not possible with grep. Therefore, you might want to look into perl, or awk.

ADD REPLY
0
Entering edit mode

I'm new in this field and need more help. It's my lucky if you could please share me the useful script to this end (removing short sequences( say less than 10 bp) with their identifiers from fasta file)

ADD REPLY
2
Entering edit mode

Where would be the fun, if you don't try it yourself? Sorry, but this forum is not for getting free ready-to-use scripts. At least not from me.... don't be too angry with me. :)

ADD REPLY
0
Entering edit mode

Please accept my apologize for many questions. I'm learning all day and night, but as I never pass any course for even some basic concepts, I try to ask just my problems from your experts.

ADD REPLY
0
Entering edit mode

It's fine that you are a learner, seta. It's just that most of us do not have the time to tweak and debug scripts for anyone (including ourselves at times). You have to try and implement changes yourself. Just learn a bit of grep, sed and awk and you'll be all set. You can always read up on Python if the situation demands a more complicated approach.

ADD REPLY
2
Entering edit mode

Not a great idea, asking for a tailored script. Not many here have the time to tweak and debug scripts for others.

ADD REPLY
0
Entering edit mode

If you modify one line in Manvendra Singh's Perl script, you will get what you want. Change 1 to 10 or whatever sequence length threshold you want. I also added a condition to ignore sequences that are "Sequence unavailable".

print ">$_" if ((length $seq) > 10 && $seq !~ "Sequence unavailable");
ADD REPLY
0
Entering edit mode

Maybe I'll write the full script as an answer

ADD REPLY
1
Entering edit mode
9.9 years ago
Siva ★ 1.9k

Here's one way to do with grep in two steps

grep -B 1 'Sequence unavailable' INPUT_FILE | sort | uniq > TMP_FILE
grep -v -f TMP_FILE INPUT_FILE > OUTPUT_FILE
  • INPUT_FILE is your multi-FASTA sequence file
  • TMP_FILE is a unique list of FASTA headers you do not want and one 'Sequence unavailable' line
  • OUT_FILE is the file without the 'Sequence unavailable' and their corresponding FASTA headers
ADD COMMENT
0
Entering edit mode

thanks siva. I tried the first one and it's work well. the command that David suggested is exactly my mean

ADD REPLY
0
Entering edit mode

I did not see David's comment before posting which I think is better as it combines these two steps in to one without needing a temp file.

ADD REPLY
0
Entering edit mode

It is actually the same command. The command from Siva is just easier to read and understand. :)

ADD REPLY

Login before adding your answer.

Traffic: 2048 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6