Question

How to remove ambiguous amino acid code containing sequences from a Fasta file?

0

Entering edit mode

5.8 years ago

Anisur Rahman ▴ 80

I have a fasta file with 396 protein sequences (the header in one line and amino acid codes in several lines). Some of these sequences contain ambiguous or exceptional amino acid codes (e.g., B, J, O, U, Z, X, -- ). I want to remove sequences containing such code and generate a new fasta file. How I can do this in the Ubuntu terminal? Thanks in advance.

sequence • 4.7k views

ADD COMMENT • link updated 5.8 years ago by Pierre Lindenbaum 166k • written 5.8 years ago by Anisur Rahman ▴ 80

0

Entering edit mode

not sure what your goal is but simply removing those ambiguous AA from the sequence is likely not the best idea (as you will change/destroy the overall context of that protein).

Replacing the ambiguous ones with X for instance should work . Normally none the the tools dealing with protein sequences should have a problem with X as an "aminoacid" .

ADD REPLY • link 5.8 years ago by lieven.sterck 15k

0

Entering edit mode

I want to remove the whole sequence which contain ambiguous AA code.

ADD REPLY • link 5.8 years ago by Anisur Rahman ▴ 80

0

Entering edit mode

I will use these sequences for population conservancy analysis in the immune epitope database. ambiguous AA containing sequence cause error during this analysis. I don't know replacing ambiguous with X will work or not, I didn't try this approach. Ok please, also suggest me how I can replace ambiguous AA with X?

ADD REPLY • link 5.8 years ago by Anisur Rahman ▴ 80

0

Entering edit mode

building on the one liner Pierre Lindenbaum provided below :

sed '/^[^>]/s/[BJOUZ]/X/g' in.fa  > out.fa

this will replace all occurrences of B, J, O, U, Z with an X .

removing the whole sequence makes sense as well but will require different code.

ADD REPLY • link 5.8 years ago by lieven.sterck 15k

0

Entering edit mode

Thank you for your explanation. I am trying this code also. Please, suggest me the code to remove the whole sequence.

ADD REPLY • link 5.8 years ago by Anisur Rahman ▴ 80

score 1 · Answer 1 · 2019-10-09

1

Entering edit mode

5.8 years ago

lieven.sterck 15k

if you don't mind ending up a with a fasta file where the sequence is on a single line you could give the following a try:

perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' <your_file> | paste - - | grep -v "\t.*[BJOUZ]" | tr "\t" "\n"

this will first put all sequences on single line perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' , then put both header and sequence on a single line (tab separated) paste - - , then remove the lines containing the chars you don't want grep -v "\t.*[BJOUZ]" , and finally split header and sequence back to two lines tr "\t" "\n"