Hello,
I'm doing a small bioinformatics project for my class. I have an idea which requires me to compare two lists of multiple protein sequences of the same length to each other and find out how similar they are, like a percentage.
Say I have two arrays, A and B, each containing 20 aligned protein sequences of the same type and roughly same length. So, same protein but different organisms. Let's assume array A contains protein sequences of mammals and array B contains protein sequences of birds. My goal is to find out the similarity or genetic distance between these two types of species using the given sequences.
Any ideas on how to approach this? One idea I had was aligning the sequences of the first array and second arrays first, then creating an "average" sequence for each array using the most common nucleotide in each position and then comparing the two sequences to each other, calculating a similarity percentage. But I'm not sure that this approach would be accurate, wouldn't it result to a skewed percentage?
Thanks in advance.
There are a few different approaches that could be useful for determining a similarity matrix: