For EMBOSS needle, and the other EMBOSS pairwise alignment programs (global or local), report these statistics as part of the statistics section of the default alignment output format. For example for an EMBOSS needle alignment of UniProtKB:WAP_RAT and UniProtKB:WAP_MOUSE the alignment reports:
#=======================================
#
# Aligned_sequences: 2
# 1: WAP_RAT
# 2: WAP_MOUSE
# Matrix: EBLOSUM62
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 138
# Identity: 91/138 (65.9%)
# Similarity: 101/138 (73.2%)
# Gaps: 5/138 ( 3.6%)
# Score: 461.5
#
#
#=======================================
Here the 'Length' reported is the length of the alignment (not the sequences aligned), so the number of match, mismatch and gaps would be:
- Match: 101
- Mismatch: 138 - 101 - 5 = 32
- Gaps: 5
You can confirm this by counting the consensus symbols in the alignment ('|' = identity, ':' = similarity, '.' = mismatch and ' ' gap).
In NCBI BLAST the statistics reported for the alignment include equivalent information. For example for a 'blastp' of UniProtKB:WAP_RAT vs. UniProtKB:WAP_MOUSE:
>lcl|SP:WAP_MOUSE P01173 Whey acidic protein OS=Mus musculus GN=Wap PE=1 SV=3
Length=134
Score = 180 bits (457), Expect = 4e-57
Identities = 88/129 (68%), Positives = 97/129 (75%), Gaps = 1/129 (1%)
So that would be:
- Match: 97
- Mismatch: 129 - 97 - 1 = 31
- Gaps: 1
Note: these numbers are different from those given by EMBOSS needle, since 'blastp' returns a local alignment, instead of the global alignment produced by EMBOSS needle. In addition for BLAST multiple HSPs can be returned, uncommon for protein alignments but does occur, and you would have to figure out how you wanted to handle these.
For the FASTA suite programs, while some of the information is reported in the extended summary formats (e.g. with '-m 9' the alignment length ('alen') and the number of gap "residues" in the query ('gapq') and subject ('gapl') are reported), the alignment produced would have to be parsed to get the required numbers, since this information does not include the match/mismatch counts. That said given the numbers provided all that would be required would be to parse the consensus symbol line to count the number of identities (':') and similarities ('.'). So for a 'ggsearch' of UniProtKB:WAP_RAT vs. UniProtKB:WAP_MOUSE:
From the summary information:
The best scores are: n-w bits E(541561) %_id %_sim gnw alen an0 ax0 pn0 px0 an1 ax1 pn1 px1 gapq gapl fs
SP:WAP_RAT P01174 Whey acidic protein OS=Rattus norvegicus G ( 137) 1005 95.3 0 1.000 1.000 1005 137 1 137 1 137 1 137 1 137 0 0 0
SP:WAP_MOUSE P01173 Whey acidic protein OS=Mus musculus GN=W ( 134) 620 66.4 1.5e-164 0.659 0.775 620 138 1 137 1 137 1 134 1 134 1 4 0
And the alignment:
10 20 30 40 50 60 70 80 90
sp|P0 MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRSCKTPVNIEVQKAGRCPWNPIQMIAA-GPCPKD
::: :::::::::::::::.::.:.:::::::: . : : :::: :::::::::: ::::.::::. :::::: : ::: :::: .: :.. ::::
SP:WAP MRCLISLVLGLLALEVALAQNLEEQVFNSVQSMFQKASPIEGTECIICQTNEECAQNAMCCPGSCGRTRKTPVNIGVPKAGFCPWNLLQTISSTGPCPMK
10 20 30 40 50 60 70 80 90 100
100 110 120 130
sp|P0 NPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ
:: : .:::.::::. :.:.: : : ::..:
SP:WAP IECSSDRECSGNMKCCNVDCVMTCTPPVP----VITLQ
110 120 130
That gives:
- Match: 107
- Mismatch: 138 - (1 + 4) - 107 = 26
- Gaps: 1 + 4 = 5
Or, slightly easier for an alignment like this when do it by hand, by counting mismatches in the consensus:
- Match: 138 - 31 = 107
- Mismatch: 31 - (1 + 4) = 26
- Gaps: 1 + 4 = 5
While 'ggsearch' uses the same alignment method as EMBOSS needle, the default scoring matrix is different (BLOSUM50 vs. BLOSUM62) and the calling of similarities works a little differently:
- FASTA suite programs (e.g. 'fasta', 'ggsearch' and 'ssearch'): negative scores are mismatches
- EMBOSS programs: positive scores are matches
- NCBI BLAST: positive scores are matches
This means that the FASTA programs include matches with a zero score in the matrix (e.g. 'S' vs. 'Q') as similarities but EMBOSS and NCBI BLAST do not. This is why the numbers are a little bit different for EMBOSS needle and FASTA 'ggsearch'.
For my example the differences in gap scoring do not have any impact, but in a larger comparison the behaviour of the gap parameters would also have to be taken in to account.
Protein alignment is different from sequence alignment as it uses a substitution matrix that scores the substitution of one amino acids to other. If you want to do a straightforward alignment then you can use any string alignment algorithm but you will have to decide proper mismatch, match and gap penalty scores. http://en.wikipedia.org/wiki/List_of_sequence_alignment_software
my question is: does any of these software give me for two aligned sequences how many amino acids are matched/unmatched and how many gap are inserted in the alignment?
most tools will give you these stats, (e.g. blastp does in its standard format), not sure about needle. However, blast does local alignment not global. Try ggsearch (http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml) with format option
-m 9
should give all the info you need.Needle does give you those statistics. Try looking at the beginning of the output, right before the alignment itself.