Question

Protein Sequence Alignment Basic Statistics

0

Entering edit mode

11.8 years ago

Maria ▴ 170

Hi,

I want to perform a protein global pairwise alignment, and do some basic statistics i.e. number of match, mismatch and gaps. Is there any tool that do this for me? I tried blastp and emboss needle they do the alignment but I couldn't figure out if I can get automatically the wanted statistics.

Thanks in advance

protein alignment statistics • 5.9k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 11.8 years ago by Maria ▴ 170

0

Entering edit mode

Protein alignment is different from sequence alignment as it uses a substitution matrix that scores the substitution of one amino acids to other. If you want to do a straightforward alignment then you can use any string alignment algorithm but you will have to decide proper mismatch, match and gap penalty scores. http://en.wikipedia.org/wiki/List_of_sequence_alignment_software

ADD REPLY • link 11.8 years ago by Ashutosh Pandey 12k

0

Entering edit mode

my question is: does any of these software give me for two aligned sequences how many amino acids are matched/unmatched and how many gap are inserted in the alignment?

ADD REPLY • link 11.8 years ago by Maria ▴ 170

0

Entering edit mode

most tools will give you these stats, (e.g. blastp does in its standard format), not sure about needle. However, blast does local alignment not global. Try ggsearch (http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml) with format option -m 9 should give all the info you need.

ADD REPLY • link 11.8 years ago by Michael 56k

0

Entering edit mode

Needle does give you those statistics. Try looking at the beginning of the output, right before the alignment itself.

ADD REPLY • link 11.8 years ago by João Rodrigues ★ 2.5k

score 3 · Answer 1 · 2013-11-01

For EMBOSS needle, and the other EMBOSS pairwise alignment programs (global or local), report these statistics as part of the statistics section of the default alignment output format. For example for an EMBOSS needle alignment of UniProtKB:WAP_RAT and UniProtKB:WAP_MOUSE the alignment reports:

#=======================================
#
# Aligned_sequences: 2
# 1: WAP_RAT
# 2: WAP_MOUSE
# Matrix: EBLOSUM62
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 138
# Identity:      91/138 (65.9%)
# Similarity:   101/138 (73.2%)
# Gaps:           5/138 ( 3.6%)
# Score: 461.5
# 
#
#=======================================

Here the 'Length' reported is the length of the alignment (not the sequences aligned), so the number of match, mismatch and gaps would be:

Match: 101
Mismatch: 138 - 101 - 5 = 32
Gaps: 5

You can confirm this by counting the consensus symbols in the alignment ('|' = identity, ':' = similarity, '.' = mismatch and ' ' gap).

In NCBI BLAST the statistics reported for the alignment include equivalent information. For example for a 'blastp' of UniProtKB:WAP_RAT vs. UniProtKB:WAP_MOUSE:

>lcl|SP:WAP_MOUSE P01173 Whey acidic protein OS=Mus musculus GN=Wap PE=1 SV=3
Length=134

 Score =  180 bits (457),  Expect = 4e-57
 Identities = 88/129 (68%), Positives = 97/129 (75%), Gaps = 1/129 (1%)

So that would be:

Match: 97
Mismatch: 129 - 97 - 1 = 31
Gaps: 1

Note: these numbers are different from those given by EMBOSS needle, since 'blastp' returns a local alignment, instead of the global alignment produced by EMBOSS needle. In addition for BLAST multiple HSPs can be returned, uncommon for protein alignments but does occur, and you would have to figure out how you wanted to handle these.

For the FASTA suite programs, while some of the information is reported in the extended summary formats (e.g. with '-m 9' the alignment length ('alen') and the number of gap "residues" in the query ('gapq') and subject ('gapl') are reported), the alignment produced would have to be parsed to get the required numbers, since this information does not include the match/mismatch counts. That said given the numbers provided all that would be required would be to parse the consensus symbol line to count the number of identities (':') and similarities ('.'). So for a 'ggsearch' of UniProtKB:WAP_RAT vs. UniProtKB:WAP_MOUSE:

From the summary information:

The best scores are:                                                n-w bits E(541561)    %_id  %_sim  gnw  alen  an0  ax0  pn0  px0  an1  ax1 pn1 px1 gapq gapl  fs 
SP:WAP_RAT P01174 Whey acidic protein OS=Rattus norvegicus G ( 137) 1005 95.3       0    1.000 1.000 1005  137    1  137    1  137    1  137    1  137   0   0   0
SP:WAP_MOUSE P01173 Whey acidic protein OS=Mus musculus GN=W ( 134)  620 66.4 1.5e-164    0.659 0.775  620  138    1  137    1  137    1  134    1  134   1   4   0

And the alignment:

               10        20        30        40        50        60        70        80        90          
sp|P0  MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRSCKTPVNIEVQKAGRCPWNPIQMIAA-GPCPKD
       ::: :::::::::::::::.::.:.:::::::: .  :  : :::: :::::::::: ::::.::::. :::::: : ::: :::: .: :.. ::::  
SP:WAP MRCLISLVLGLLALEVALAQNLEEQVFNSVQSMFQKASPIEGTECIICQTNEECAQNAMCCPGSCGRTRKTPVNIGVPKAGFCPWNLLQTISSTGPCPMK
               10        20        30        40        50        60        70        80        90       100

     100       110       120       130       
sp|P0  NPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ
         :: : .:::.::::.  :.:.:  : :    ::..:
SP:WAP IECSSDRECSGNMKCCNVDCVMTCTPPVP----VITLQ
              110       120           130

That gives:

Match: 107
Mismatch: 138 - (1 + 4) - 107 = 26
Gaps: 1 + 4 = 5

Or, slightly easier for an alignment like this when do it by hand, by counting mismatches in the consensus:

Match: 138 - 31 = 107
Mismatch: 31 - (1 + 4) = 26
Gaps: 1 + 4 = 5

While 'ggsearch' uses the same alignment method as EMBOSS needle, the default scoring matrix is different (BLOSUM50 vs. BLOSUM62) and the calling of similarities works a little differently:

FASTA suite programs (e.g. 'fasta', 'ggsearch' and 'ssearch'): negative scores are mismatches
EMBOSS programs: positive scores are matches
NCBI BLAST: positive scores are matches

This means that the FASTA programs include matches with a zero score in the matrix (e.g. 'S' vs. 'Q') as similarities but EMBOSS and NCBI BLAST do not. This is why the numbers are a little bit different for EMBOSS needle and FASTA 'ggsearch'.

For my example the differences in gap scoring do not have any impact, but in a larger comparison the behaviour of the gap parameters would also have to be taken in to account.