Statistics of amino acid usage of multiple aligned sequences?
1
0
Entering edit mode
5.8 years ago
johnnytam100 ▴ 110

I would like to know if there is a tool which could do statistics of amino acid usage of multiple aligned sequences?

I can do it in excel, just want to know if there is a tool conveniently do the job.

Thanks!

alignment • 2.1k views
ADD COMMENT
0
Entering edit mode

If the input files are FASTA, then you could narrow your search to a tool that summarizes amino acid usage in FASTA files.

ADD REPLY
0
Entering edit mode

Yes, the input is an aligned .fasta file.

ADD REPLY
0
Entering edit mode

What is it exactly you want to know?

Are these hypothetical nucleotide or protein alignments?

ADD REPLY
0
Entering edit mode

I want to know, for each aligned position, the percentage of usage of amino acid species e.g. 80% A, 10% G, 10% V

ADD REPLY
1
Entering edit mode
5.8 years ago
Joe 21k

You can probably use something like Jalview to get what you want, but to my mind, it doesn't get much easier than:

from Bio import AlignIO
from collections import Counter
import sys

aln = AlignIO.read(sys.argv[1], 'phylip')

for i in range(aln.get_alignment_length()):
    print(Counter(aln[:, i]))

e.g. given this input alignment:

    16    149
PAU_02775  MSTTPEQIAV EYPIPTYRFV VSLGDEQIPF NSVSGLDISH DVIEYKDGTG
PLT_01696  MSTTPEQIAV EYPIPTYRFV VSIGDEQIPF NSVSGLDISH DVIEYKDGTG
PAK_02606  MSTTPEQIAV EYPIPTYRFV VSIGDEQVPF NSVSGLDISH DVIEYKDGTG
PLT_01736  MSTTPEQIAV EYPIPTYRFV VSIGDEKVPF NSVSGLDISH DVIEYKDGTG
PAK_01896  MTTTT----V DYPIPAYRFV VSVGDEQIPF NNVSGLDITY DVIEYKDGTG
PAU_02074  MATTT----V DYPIPAYRFV VSVGDEQIPF NSVSGLDITY DVIEYKDGTG
PLT_02424  MSVTTEQIAV DYPIPTYRFV VSVGDEQIPF NNVSGLDITY DVIEYKDGTG
PLT_01716  MTITPEQIAV DYPIPAYRFV VSVGDEKIPF NNVSGLDVHY DVIEYKDGTG
PLT_01758  MAITPEQIAV EYPIPTYRFV VSVGDEQIPF NNVSGLDVHY DVIEYKDGIG
PAK_03203  MSTSTSQIAV EYPIPVYRFI VSIGDDQIPF NSVSGLDINY DTIEYRDGVG
PAU_03392  MSTSTSQIAV EYPIPVYRFI VSVGDEKIPF NSVSGLDISY DTIEYRDGVG
PAK_02014  MSITQEQIAA EYPIPSYRFM VSIGDVQVPF NSVSGLDRKY EVIEYKDGIG
PAU_02206  MSITQEQIAA EYPIPSYRFM VSIGDVQVPF NSVSGLDRKY EVIEYKDGIG
PAK_01787  MSTTADQIAV QYPIPTYRFV VTIGDEQMCF QSVSGLDISY DTIEYRDGVG
PAU_01961  MSTTADQIAV QYPIPTYRFV VTIGDEQMCF QSVSGLDISY DTIEYRDGVG
PLT_02568  MSTTVDQIAV QYPIPTYRFV VTVGDEQMSF QSVSGLDISY DTIEYRDGIG

           NYYKMPGQRQ AINISLRKGV FSGDTKLFDW INSIQLNQVE KKDISISLTN
           NYYKMPGQRQ AINISLRKGV FSGDTKLFDW INSIQLNQVE KKDISISLTN
           NYYKMPGQRQ AINISLRKGV FSGDTKLFDW INSIQLNQVE KKDISISLTN
           NYYKMPGQRQ AINITLRKGV FSGDTKLFDW LNSIQLNQVE KKDISISLTN
           NYYKMPGQRQ LINITLRKGV FPGDTKLFDW LNSIQLNQVE KKDVSISLTN
           NYYKMPGQRQ LINITLRKGV FPGDTKLFDW LNSIQLNQVE KKDVSISLTN
           NHYKMPGQRQ LINITLRKGV FPGDTKLFDW LNSIQLNQVE KKDVSISLTN
           NYYKMPGQRQ SINITLRKGV FPGDTKLFDW INSIQLNQVE KKDIAISLTN
           NYYKMPGQRQ SINITLRKGV FPGDTKLFDW INSIQLNQVE KKDIAISLTN
           NWFKMPGQSQ LVNITLRKGV FPGKTELFDW INSIQLNQVE KKDITISLTN
           NWFKMPGQSQ STNITLRKGV FPGKTELFDW INSIQLNQVE KKDITISLTN
           NYYKMPGQIQ RVDITLRKGI FSGKNDLFNW INSIELNRVE KKDITISLTN
           NYYKMPGQIQ RVDITLRKGI FSGKNDLFNW INSIELNRVE KKDITISLTN
           NWLQMPGQRQ RPTITLKRGI FKGQSKLYDW INSISLNQIE KKDISISLTD
           NWLQMPGQRQ RPTITLKRGI FKGQSKLYDW INSISLNQIE KKDISISLTD
           NWLQMPGQRQ RPSITLKRGI FKGQSKLYDW INSISLNQIE KKDISISLTD

           EAGTEILMTW SVANAFPTSL TSPSFDATSN EVAVQEITLT ADRVTIQAA
           EAGTEILMTW SVANAFPTSL ISPSFDATSN EVAVQEITLT ADRVTIQAA
           EAGTEILMTW SVANAFPTSL TSPSFDATSN EVAVQEITLT ADRVTIQAA
           EAGTEILMTW SVANAFPTSL TAPAFDATSN EVAVQEISLT ADRVTIQAA
           ETGTEILMSW SVANAFPTSL TSPSFDATSN DIAVQEIKLT ADRVTIQAA
           EVGTEILMTW SVANAFPTSL TSPSFDATSN DIAVQEIKLT ADRVTIQAA
           EAGTEILMSW SVANAFPTSL TSPSFDATSN DIAVQEIKLT ADRVMIQAA
           ETGSQILMTW NVANAFPTSF TSPSFDAASN DIAIQEIALV ADRVTIQAP
           EAGTEILMTW NVANAFPTSF TSPSFDATSN EIAVQEIALT ADRVTIQAA
           DAGTELLMTW NVSNAFPTSL TSPSFDATSN DIAVQEITLT ADRVIMQAV
           DAGTELLMTW NVSNAFPTSL TSPSFDATSN DIAVQEITLM ADRVIMQAV
           DTGSEVLMSW VVSNAFPSSL TAPSFDASSN EIAVQEISLV ADRVTIQVP
           DTGSKVLMSW VVSNAFPSSL TAPSFDASSN EIAVQEISLV ADRVTIQVP
           ETGSNLLITW NIANAFPEKL TAPSFDATSN EVAVQEMSLK ADRVTVEFH
           ETGSNLLITW NIANAFPEKL TAPSFDATSN EVAVQEISLK ADRVTVEFH
           ETGSNLLITW NIANAFPEKL TAPSFDATSN EVAVQEISLK ADRVTVEFH

The result would be:

$ python script.py inputseqs.phy
Counter({'M': 16})
Counter({'S': 12, 'A': 2, 'T': 2})
 ... # Truncated output to stay in post character limit.
Counter({'V': 16})
Counter({'T': 13, 'I': 2, 'M': 1})
Counter({'I': 11, 'V': 3, 'M': 2})
Counter({'Q': 13, 'E': 3})
Counter({'A': 11, 'F': 3, 'V': 2})
Counter({'A': 8, 'P': 3, 'H': 3, 'V': 2})

If you want to ignore gaps, you'll have to do something slightly different.

ADD COMMENT
0
Entering edit mode

Hi Joe, I am interested in your answer too. I am not a computer scientist and have downloaded Jalview. How do I input the codes to do the calculation of number of each amino acid for each position?

ADD REPLY
0
Entering edit mode

You can't use the above code with Jalview. I was suggesting them as alternatives. Jalview might have some scripting capability (I don't know), but if it does, it would be in Java not python.

ADD REPLY

Login before adding your answer.

Traffic: 1990 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6