How to convert amino acid sequence to numbers
1
0
Entering edit mode
2.8 years ago

I have this file

files = list(SeqIO.parse("proteasomes.fasta", "fasta"))

Which is a list of proteasome Amino Acid sequences like "MALASVLERPLPVNQRGFFGLGGRADLLDLGPGSLSDGLSL..."

I want to convert each letter of the sequence to a number specified in this dictionary

AMINO_ACID_TO_ID = {'0': 0,
                'A': 1,
                'C': 2,
                'D': 3,
                'E': 4,
                'F': 5,
                'G': 6,
                'H': 7,
                'I': 8,
                'K': 9,
                'L': 10,
                'M': 11,
                'N': 12,
                'P': 13,
                'Q': 14,
                'R': 15,
                'S': 16,
                'T': 17,
                'V': 18,
                'W': 19,
                'Y': 20}

Sample code I tried but did not work

converted = np.asarray[AMINO_ACID_TO_ID[(files[0].seq)]]

Any quick way to do this?

fasta acids amino • 1.6k views
ADD COMMENT
0
Entering edit mode

out of curiosity, how will you make the distinction between for instance two As and one M ?

ADD REPLY
0
Entering edit mode

Not sure yet I was thinking a numpy array. The goal is to use the fasta files as input data for a neural net to generate similar sequences.

ADD REPLY
1
Entering edit mode

ah, ok, you keep them in an array (would indeed only be a problem if you print them in a line again)

ADD REPLY
0
Entering edit mode

That is a good point because they would need to be converted back to letters at the end.

ADD REPLY
2
Entering edit mode
2.8 years ago
Mensur Dlakic ★ 28k

You may want to read about one-hot encoding of amino acids. Out of curiosity, why is 0 not used for an amino acid?

ADD COMMENT
0
Entering edit mode

Thank you for your sources. I copied that part of the code so I'm not sure why 0 is not used.

ADD REPLY

Login before adding your answer.

Traffic: 3865 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6