I would like to know if there is a tool which could do statistics of amino acid usage of multiple aligned sequences?
I can do it in excel, just want to know if there is a tool conveniently do the job.
Thanks!
I would like to know if there is a tool which could do statistics of amino acid usage of multiple aligned sequences?
I can do it in excel, just want to know if there is a tool conveniently do the job.
Thanks!
You can probably use something like Jalview to get what you want, but to my mind, it doesn't get much easier than:
from Bio import AlignIO
from collections import Counter
import sys
aln = AlignIO.read(sys.argv[1], 'phylip')
for i in range(aln.get_alignment_length()):
print(Counter(aln[:, i]))
e.g. given this input alignment:
16 149
PAU_02775 MSTTPEQIAV EYPIPTYRFV VSLGDEQIPF NSVSGLDISH DVIEYKDGTG
PLT_01696 MSTTPEQIAV EYPIPTYRFV VSIGDEQIPF NSVSGLDISH DVIEYKDGTG
PAK_02606 MSTTPEQIAV EYPIPTYRFV VSIGDEQVPF NSVSGLDISH DVIEYKDGTG
PLT_01736 MSTTPEQIAV EYPIPTYRFV VSIGDEKVPF NSVSGLDISH DVIEYKDGTG
PAK_01896 MTTTT----V DYPIPAYRFV VSVGDEQIPF NNVSGLDITY DVIEYKDGTG
PAU_02074 MATTT----V DYPIPAYRFV VSVGDEQIPF NSVSGLDITY DVIEYKDGTG
PLT_02424 MSVTTEQIAV DYPIPTYRFV VSVGDEQIPF NNVSGLDITY DVIEYKDGTG
PLT_01716 MTITPEQIAV DYPIPAYRFV VSVGDEKIPF NNVSGLDVHY DVIEYKDGTG
PLT_01758 MAITPEQIAV EYPIPTYRFV VSVGDEQIPF NNVSGLDVHY DVIEYKDGIG
PAK_03203 MSTSTSQIAV EYPIPVYRFI VSIGDDQIPF NSVSGLDINY DTIEYRDGVG
PAU_03392 MSTSTSQIAV EYPIPVYRFI VSVGDEKIPF NSVSGLDISY DTIEYRDGVG
PAK_02014 MSITQEQIAA EYPIPSYRFM VSIGDVQVPF NSVSGLDRKY EVIEYKDGIG
PAU_02206 MSITQEQIAA EYPIPSYRFM VSIGDVQVPF NSVSGLDRKY EVIEYKDGIG
PAK_01787 MSTTADQIAV QYPIPTYRFV VTIGDEQMCF QSVSGLDISY DTIEYRDGVG
PAU_01961 MSTTADQIAV QYPIPTYRFV VTIGDEQMCF QSVSGLDISY DTIEYRDGVG
PLT_02568 MSTTVDQIAV QYPIPTYRFV VTVGDEQMSF QSVSGLDISY DTIEYRDGIG
NYYKMPGQRQ AINISLRKGV FSGDTKLFDW INSIQLNQVE KKDISISLTN
NYYKMPGQRQ AINISLRKGV FSGDTKLFDW INSIQLNQVE KKDISISLTN
NYYKMPGQRQ AINISLRKGV FSGDTKLFDW INSIQLNQVE KKDISISLTN
NYYKMPGQRQ AINITLRKGV FSGDTKLFDW LNSIQLNQVE KKDISISLTN
NYYKMPGQRQ LINITLRKGV FPGDTKLFDW LNSIQLNQVE KKDVSISLTN
NYYKMPGQRQ LINITLRKGV FPGDTKLFDW LNSIQLNQVE KKDVSISLTN
NHYKMPGQRQ LINITLRKGV FPGDTKLFDW LNSIQLNQVE KKDVSISLTN
NYYKMPGQRQ SINITLRKGV FPGDTKLFDW INSIQLNQVE KKDIAISLTN
NYYKMPGQRQ SINITLRKGV FPGDTKLFDW INSIQLNQVE KKDIAISLTN
NWFKMPGQSQ LVNITLRKGV FPGKTELFDW INSIQLNQVE KKDITISLTN
NWFKMPGQSQ STNITLRKGV FPGKTELFDW INSIQLNQVE KKDITISLTN
NYYKMPGQIQ RVDITLRKGI FSGKNDLFNW INSIELNRVE KKDITISLTN
NYYKMPGQIQ RVDITLRKGI FSGKNDLFNW INSIELNRVE KKDITISLTN
NWLQMPGQRQ RPTITLKRGI FKGQSKLYDW INSISLNQIE KKDISISLTD
NWLQMPGQRQ RPTITLKRGI FKGQSKLYDW INSISLNQIE KKDISISLTD
NWLQMPGQRQ RPSITLKRGI FKGQSKLYDW INSISLNQIE KKDISISLTD
EAGTEILMTW SVANAFPTSL TSPSFDATSN EVAVQEITLT ADRVTIQAA
EAGTEILMTW SVANAFPTSL ISPSFDATSN EVAVQEITLT ADRVTIQAA
EAGTEILMTW SVANAFPTSL TSPSFDATSN EVAVQEITLT ADRVTIQAA
EAGTEILMTW SVANAFPTSL TAPAFDATSN EVAVQEISLT ADRVTIQAA
ETGTEILMSW SVANAFPTSL TSPSFDATSN DIAVQEIKLT ADRVTIQAA
EVGTEILMTW SVANAFPTSL TSPSFDATSN DIAVQEIKLT ADRVTIQAA
EAGTEILMSW SVANAFPTSL TSPSFDATSN DIAVQEIKLT ADRVMIQAA
ETGSQILMTW NVANAFPTSF TSPSFDAASN DIAIQEIALV ADRVTIQAP
EAGTEILMTW NVANAFPTSF TSPSFDATSN EIAVQEIALT ADRVTIQAA
DAGTELLMTW NVSNAFPTSL TSPSFDATSN DIAVQEITLT ADRVIMQAV
DAGTELLMTW NVSNAFPTSL TSPSFDATSN DIAVQEITLM ADRVIMQAV
DTGSEVLMSW VVSNAFPSSL TAPSFDASSN EIAVQEISLV ADRVTIQVP
DTGSKVLMSW VVSNAFPSSL TAPSFDASSN EIAVQEISLV ADRVTIQVP
ETGSNLLITW NIANAFPEKL TAPSFDATSN EVAVQEMSLK ADRVTVEFH
ETGSNLLITW NIANAFPEKL TAPSFDATSN EVAVQEISLK ADRVTVEFH
ETGSNLLITW NIANAFPEKL TAPSFDATSN EVAVQEISLK ADRVTVEFH
The result would be:
$ python script.py inputseqs.phy
Counter({'M': 16})
Counter({'S': 12, 'A': 2, 'T': 2})
... # Truncated output to stay in post character limit.
Counter({'V': 16})
Counter({'T': 13, 'I': 2, 'M': 1})
Counter({'I': 11, 'V': 3, 'M': 2})
Counter({'Q': 13, 'E': 3})
Counter({'A': 11, 'F': 3, 'V': 2})
Counter({'A': 8, 'P': 3, 'H': 3, 'V': 2})
If you want to ignore gaps, you'll have to do something slightly different.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
If the input files are FASTA, then you could narrow your search to a tool that summarizes amino acid usage in FASTA files.
Yes, the input is an aligned .fasta file.
What is it exactly you want to know?
Are these hypothetical nucleotide or protein alignments?
I want to know, for each aligned position, the percentage of usage of amino acid species e.g. 80% A, 10% G, 10% V