Question

sequence similarity

0

Entering edit mode

4.3 years ago

lorenzinip • 0

Hi, I have a fasta file with hundred of sequences with a 300nt length. I would like to do check what's the similarity of one sequence against all the other sequences. Any suggestion on how to approach this? Thanks

blast linux fasta • 1.6k views

ADD COMMENT • link updated 4.3 years ago by Joe 22k • written 4.3 years ago by lorenzinip • 0

0

Entering edit mode

are you starting from a multiple sequence alignment?

You might have a look at creating distance matrices (eg for phylogenetic studies) , though the 'distance' will often not be in %similarity but will give you a measure of the similarity

ADD REPLY • link 4.3 years ago by lieven.sterck 15k

score 0 · Answer 1 · 2021-04-20

0

Entering edit mode

4.3 years ago

5heikki 11k

Use e.g. cd-hit or vsearch

ADD COMMENT • link 4.3 years ago by 5heikki 11k

0

Entering edit mode

Another tool that can be used and is much faster is MMSeqs2.

ADD REPLY • link 4.3 years ago by Sej Modha 5.3k

score 0 · Answer 2 · 2021-04-20

If your sequences are already aligned and/or the same length, or you do not want to align them, you can use some simple edit distance measures like the Levenshtein distance or other kmer based method.

This will be quick but will be less accurate, and won't necessarily capture meaningful biological patterns, but depending on your use case it may be appropriate.

I keep a few examples of string comparison metrics along with some implementation code here:

https://github.com/jrjhealey/bioinfo-tools/blob/master/StringComparisons.py