Hi,
It's a pretty simple question but I can not find any good answer on internet : How can I compute the identity between sequences at a nucleotide resolution.
Example :
Thanks a lot,
N.
Hi,
It's a pretty simple question but I can not find any good answer on internet : How can I compute the identity between sequences at a nucleotide resolution.
Example :
Thanks a lot,
N.
So I downloaded Geneious and first I used ClustalW on these 5 sequences-
>1
ATCT
>2
AGCA
>3
ATGC
>4
ATGG
>5
CGTA
When I opened alignment file in Geneious-
and if you will take your cursor on any bar-
So as it clearly says- "Mean pairwise identity over all pairs in column". In your image (image in your post not mine) lets take column 2. You have 6 pairs. 3 pairs are identcal (T-T) and 3 are not (G-T). So (100+100+100+0+0+0)/6 . In your last column - (0+0+0+0+0+0)/6=0, so no bar. Colours is given accordingly.
EDIT: As by OP's comment-
So in your image, 2nd column contains-
T
G
T
T
Make all possible pairs- 1st and 2nd (T,G) - not identical, so 0% , 1st and 3rd (T,T) - identical, so 100%, 1st and 4th (T,T) - 100%, 2nd and 3rd (G,T) - 0%, 2nd and 4th (G,T) - 0%, 3rd and 4th (T,T) - 100%.
Now calculate mean pairwise identity - (0+100+100+0+0+100)/6 = 50%
We have divided by 6 because we have 6 pairs (Mean=sum / number of pairs).
Hope this helps.
I use alistat from the HMMER package.
alistat reads a multiple sequence alignment from the file alignfile in any supported format (including SELEX, GCG MSF, and CLUSTAL), and shows a number of simple statistics about it. These statistics include the name of the format, the number of sequences, the total number of residues, the average and range of the sequence lengths, the alignment length (e.g. including gap characters).
Also shown are some percent identities. A percent pairwise alignment identity is defined as (idents / MIN(len1, len2)) where idents is the number of exact identities and len1, len2 are the unaligned lengths of the two sequences. The "average percent identity", "most related pair", and "most unrelated pair" of the alignment are the average, maximum, and minimum of all (N)(N-1)/2 pairs, respectively. The "most distant seq" is calculated by finding the maximum pairwise identity (best relative) for all N sequences, then finding the minimum of these N numbers (hence, the most outlying sequence).
Hi, you are probably confused with 'K' and other strange letters. This is IUPAC code:
http://www.bioinformatics.org/sms/iupac.html
and for example K stands for G or T. This is to give some additional information, that you would lost when writing just N as the unknown letter. First position in you sequence is A at all positions, so its easy. Or what do you mean by identity? How similiar are all sequences to each other or similiar to consensus or what?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi! Can you provide the link from which you took the figure? I think it is somehow related to PSSM and consensus sequences.
It's from Geneious. And the data are test data to explain my question. I aligned the fours sequences with clustalw and then open the output file in geneious
You could always try emailing Geneious technical support and asking. I've had to deal with them before and found them very helpful