Question

How can I remove the sequences that contain ambiguous amino acids from a multiple FASTA file? (with python))

0

Entering edit mode

2.3 years ago

M. ▴ 40

I have a FASTA file with numerous protein sequences (the header in one line and amino acid codes in several lines). Some of these sequences contain ambiguous or exceptional amino acid codes (e.g., B, J, O, U, Z, X, -- ). I want to remove sequences containing such code and generate a new FASTA file. How I can do this in python? I did manage the remove one amino acid code (X) at a time with the following code. But how can I remove them all at once?

from Bio import SeqIO

       sequences = SeqIO.parse("sequences.fasta", "fasta")
       filtered = [seq for seq in sequences if seq.seq.count('X') == 0]

       with open('sequences_without_Xs', 'wt') as output:
             SeqIO.write(filtered, output, 'fasta')

python removing ambiguous amino acids • 1.5k views

ADD COMMENT • link 2.3 years ago by M. ▴ 40

score 3 · Accepted Answer · 2022-07-22

3

Entering edit mode

2.3 years ago

Andrzej Zielezinski 11k

This code removes sequences that contain at least one character that is not an amino acid.

from Bio import SeqIO

AMINOACIDS = set('ACDEFGHIKLMNPRSTWVQY')

with open('sequences_valid.fasta', 'w') as output:
      for seq_record in SeqIO.parse("sequences.fasta", "fasta"):
            if not set(seq_record.seq).difference(AMINOACIDS):
                  output.write(seq_record.format('fasta'))

ADD COMMENT • link 2.3 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Thank you!! That works for every possible flaw.

ADD REPLY • link 2.3 years ago by M. ▴ 40

score 2 · Accepted Answer · 2022-07-22

2

Entering edit mode

2.3 years ago

Mensur Dlakic ★ 28k

Using the code below instead of your filtered line should do the trick.

filtered = [
    seq
    for seq in sequences
    if seq.seq.count("X") == 0
    and seq.seq.count("B") == 0
    and seq.seq.count("J") == 0
    and seq.seq.count("O") == 0
    and seq.seq.count("U") == 0
    and seq.seq.count("Z") == 0
]