Question

Removing identifier with "unavailable sequence" from FASTA file

1

Entering edit mode

9.9 years ago

seta ★ 1.9k

Dear all,

I have several FASTA nucleotide file, containing millions identifiers and their sequence. Among them, there are some identifier like this, ">AT1G01340|AT1G01340.2 Sequence unavailable". Would you please let me know how I can remove them? looking forward to hearing your helpful commands. thanks

sequence genome blast RNA-Seq • 6.6k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by seta ★ 1.9k

0

Entering edit mode

wouldn't this work for you ? It would remove identifiers where sequence is not available

#!/usr/bin/perl
use strict;
use warnings;

$/="\n>";
while (<>) {
 s/>//g;
 my ($id, $seq) = split (/\n/, $_);
 print ">$_" if ((length $seq) > 1);
}

save it with .pl extension and run it on your file.

hth

ADD REPLY • link updated 9.9 years ago by Ram 44k • written 9.9 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

Thanks, but it does not work at all.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by seta ★ 1.9k

1

Entering edit mode

Oh yes, I had thought that "Sequence unavailable" means that you have empty lines.

Anyways David and Siva have already given you answer, nevertheless, you can edit above script's last line

Instead of (length $seq) > 1, write ($seq) !~ "Sequence"

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

It should work. Did you run it with your file? (perl script.pl input.fasta)

ADD REPLY • link 2.7 years ago by Ram 44k

0

Entering edit mode

print ">$_" if ((length $seq) > 1);

Would this not print everything? "Sequence unavailable" would also satisfy the condition.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Siva ★ 1.9k

0

Entering edit mode

Oh yes, I had thought that "Sequence unavailable" means that he has empty lines.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

Yeah, it was not clear from the OP. I first thought that "Sequence unavailable" was part of the FASTA header until they posted an actual example.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Siva ★ 1.9k

0

Entering edit mode

OP asked this as a follow up to another question I answered yesterday. It was midnight so I postponed answering the follow up question, and woke up to a new question addressing this specific issue - I guess OP got a bit impatient and opened a new post without giving it the full context.

ADD REPLY • link 2.7 years ago by Ram 44k

1

Entering edit mode

Oh! Yes. I remember. Its the post where you and Pierre were answering.

Dear seta, keep patience dear, its very important in science.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Manvendra Singh ★ 2.2k

3

Entering edit mode

And if you take the code snippets, please try to understand them and don't just use them. Learning by doing, you know? ;)

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by David Langenberger 11k

1

Entering edit mode

Yeah, I don't mind though - less work for me :)

ADD REPLY • link 2.7 years ago by Ram 44k

2

Entering edit mode

9.9 years ago

David Langenberger 11k

If there is no sequence below these headers, you could try something like this:

grep -v "Sequence unavailable" file.fasta > newFile.fasta

It will delete all headers containing "Sequence unavailable". But before you do that, double-check that there is no sequence under these headers!

ADD COMMENT • link 9.9 years ago by David Langenberger 11k

0

Entering edit mode

Oh! I provided above script where sequences is not available at all.

Does he have the "Sequence unavailable" as the text in fasta record or its just empty which he has named as sequence unavailable?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

The "Sequence unavailable" exists as text, my mean is exactly like,">AT 1G01340|AT1G01340.

Sequence unavailable". thanks for your help me in advance.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by seta ★ 1.9k

0

Entering edit mode

Thanks, using this command just words "sequence unavailable" is disappeared, but its identifier is remained. would please help me to remove both of them from a fasta file?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by seta ★ 1.9k

0

Entering edit mode

Can you provide an example of your fasta file? Just... 5 entries with at least one being the "sequence unavailable"? :)

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by David Langenberger 11k

0

Entering edit mode

Does it look like 1, 2?

1.

>AT 1G01340|AT1G01340.
Sequence unavailable

2.

>AT 1G01340|AT1G01340. Sequence unavailable

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by David Langenberger 11k

0

Entering edit mode

Sure. that's it:

>AT1G07745|AT1G07745.2
GCCTCCTTGGCTATATATTCCACGCCCATCGTTTGGATGTGCAAGAAGAGAACCACATAA
ATGAATGATCAGATGGATTTCTTGTTTCGGTTTTACATGTAGAATTATACTGCAGCAATG
AAAATTAAAATTTGTTGAACAAATGAGTGGTGTGTATATAAGGATCACATGAATATAGAA
TCACATTTTAAAACTTTGCGCGCACACC
>AT1G08845|AT1G08845.2
Sequence unavailable
>AT1G11870|AT1G11870.1
TTTTTAATGGTTAGCTCTTTGCAATTCTTGCTGTGATTCCTCTTGTACTTGAGGTTTTGT
TCATTATACTTTTCACAGGCGCTGAGTTTTTTGTGTTCTTGTCTCAGAACATACCATTTA
TAGTTTATATCATCATAACCGATGTTCTTTGATTCAATAATAACGAAAGGGAAGTATACA
TT
>AT1G10670|AT1G10670.3
GCTTCTCTCAGTACCTTCTATGACCAAAACTTTGTCTGTGTTTTAGAGCCTTTATTTACT
GTGGTTAAGATTACTCAAGCTATAAGATACTTGCAATTTCTTGAAAACTTCTGTTGTTCG
ATTCTCTTTCCCCTAACGTTTTCTTCAGATTCAATAAATAATCGTTACTTTTTTTTCCCC
TCT

>AT1G08010|AT1G08010.2
Seqence unavailable

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by seta ★ 1.9k

4

Entering edit mode

Try this:

grep -v "$(grep -B 1 "Sequence unavailable" file.fa)" file.fa

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by David Langenberger 11k

1

Entering edit mode

Many thanks David, it's great. It wil be great if you please share me how your perfect command is applied for removing short sequences( say less than 10 bp) with their identifiers?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by seta ★ 1.9k

0

Entering edit mode

Well, that's not possible with grep. Therefore, you might want to look into perl, or awk.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by David Langenberger 11k

0

Entering edit mode

I'm new in this field and need more help. It's my lucky if you could please share me the useful script to this end (removing short sequences( say less than 10 bp) with their identifiers from fasta file)

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by seta ★ 1.9k

2

Entering edit mode

Where would be the fun, if you don't try it yourself? Sorry, but this forum is not for getting free ready-to-use scripts. At least not from me.... don't be too angry with me. :)

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by David Langenberger 11k

0

Entering edit mode

Please accept my apologize for many questions. I'm learning all day and night, but as I never pass any course for even some basic concepts, I try to ask just my problems from your experts.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by seta ★ 1.9k

0

Entering edit mode

It's fine that you are a learner, seta. It's just that most of us do not have the time to tweak and debug scripts for anyone (including ourselves at times). You have to try and implement changes yourself. Just learn a bit of grep, sed and awk and you'll be all set. You can always read up on Python if the situation demands a more complicated approach.

ADD REPLY • link 2.7 years ago by Ram 44k

2

Entering edit mode

Not a great idea, asking for a tailored script. Not many here have the time to tweak and debug scripts for others.

ADD REPLY • link 2.7 years ago by Ram 44k

0

Entering edit mode

If you modify one line in Manvendra Singh's Perl script, you will get what you want. Change 1 to 10 or whatever sequence length threshold you want. I also added a condition to ignore sequences that are "Sequence unavailable".

print ">$_" if ((length $seq) > 10 && $seq !~ "Sequence unavailable");

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Siva ★ 1.9k

0

Entering edit mode

Maybe I'll write the full script as an answer

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Manvendra Singh ★ 2.2k

1

Entering edit mode

9.9 years ago

Siva ★ 1.9k

Here's one way to do with grep in two steps

grep -B 1 'Sequence unavailable' INPUT_FILE | sort | uniq > TMP_FILE
grep -v -f TMP_FILE INPUT_FILE > OUTPUT_FILE

INPUT_FILE is your multi-FASTA sequence file
TMP_FILE is a unique list of FASTA headers you do not want and one 'Sequence unavailable' line
OUT_FILE is the file without the 'Sequence unavailable' and their corresponding FASTA headers

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Siva ★ 1.9k

0

Entering edit mode

thanks siva. I tried the first one and it's work well. the command that David suggested is exactly my mean

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by seta ★ 1.9k

0

Entering edit mode

I did not see David's comment before posting which I think is better as it combines these two steps in to one without needing a temp file.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Siva ★ 1.9k

0

Entering edit mode

It is actually the same command. The command from Siva is just easier to read and understand. :)

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by David Langenberger 11k

Ram · Accepted Answer · 2015-01-22

3

Entering edit mode

9.9 years ago

Manvendra Singh ★ 2.2k

I have modified the perl script which should work for your cause

#!/usr/bin/perl
use strict;
use warnings;

$/="\n>";

while (<>) {
 s/>//g;
  my ($id, $seq) = split (/\n/, $_);
  print ">$_" if ((length $seq) > 10 && $seq !~ "Sequence unavailable");
}

hth

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

thanks a lot dear friend.

ADD REPLY • link 9.9 years ago by seta ★ 1.9k