Hi, I have a fasta file with hundred of sequences with a 300nt length. I would like to do check what's the similarity of one sequence against all the other sequences. Any suggestion on how to approach this? Thanks
Hi, I have a fasta file with hundred of sequences with a 300nt length. I would like to do check what's the similarity of one sequence against all the other sequences. Any suggestion on how to approach this? Thanks
If your sequences are already aligned and/or the same length, or you do not want to align them, you can use some simple edit distance measures like the Levenshtein distance or other kmer based method.
This will be quick but will be less accurate, and won't necessarily capture meaningful biological patterns, but depending on your use case it may be appropriate.
I keep a few examples of string comparison metrics along with some implementation code here:
https://github.com/jrjhealey/bioinfo-tools/blob/master/StringComparisons.py
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
are you starting from a multiple sequence alignment?
You might have a look at creating distance matrices (eg for phylogenetic studies) , though the 'distance' will often not be in %similarity but will give you a measure of the similarity