I have a fasta file with 396 protein sequences (the header in one line and amino acid codes in several lines). Some of these sequences contain ambiguous or exceptional amino acid codes (e.g., B, J, O, U, Z, X, -- ). I want to remove sequences containing such code and generate a new fasta file. How I can do this in the Ubuntu terminal? Thanks in advance.
not sure what your goal is but simply removing those ambiguous AA from the sequence is likely not the best idea (as you will change/destroy the overall context of that protein).
Replacing the ambiguous ones with X for instance should work . Normally none the the tools dealing with protein sequences should have a problem with X as an "aminoacid" .
I want to remove the whole sequence which contain ambiguous AA code.
I will use these sequences for population conservancy analysis in the immune epitope database. ambiguous AA containing sequence cause error during this analysis. I don't know replacing ambiguous with X will work or not, I didn't try this approach. Ok please, also suggest me how I can replace ambiguous AA with X?
building on the one liner Pierre Lindenbaum provided below :
this will replace all occurrences of B, J, O, U, Z with an X .
removing the whole sequence makes sense as well but will require different code.
Thank you for your explanation. I am trying this code also. Please, suggest me the code to remove the whole sequence.