I have thousands of sequences that each contain a certain distribution of a specific amino acid (in my case cysteine). I would like a way of grouping these sequences by distribution similarity. Here is an example:
SEQ_A 7 18 32 48 67 100
SEQ_B 26 56 89 112 138 178
SEQ_C 20 44 71 94 120 160
SEQ_D 11 26 44 54 67 94
SEQ_X
is the sequence ID and each number is the position of C
in the sequence for SEQ_X
.
I would either like to order these by "similarity" or find some way to obtain a score that quantifies the distribution.
How would I would go about doing this?
Is
aa
the most relevant tag for this post? Notamino acids
ordistribution
orsimilarity
, butaa
?can't it be done by hierarchy clustering?