Hi all!
I need to cluster sequences with 30-40% of identity threshold. Which tool would you recommend?
I've tried CD-HIT, but it's not recommended for such low identity. I was also searching for clustering with identity threshold in MUMmer, kalign and clustalO, but I could not find it.
Any help would be much appreciated!
Edit: I have two types of input files to cluster, basic fasta file with multiple sequences and aligned fasta file (kalign).
Best, Agata
Other tools that came to mind are USEARCH and VSEARCH (https://drive5.com/usearch/manual/uclust_algo.html) but they work quite similar as CD-HIT. Because of the low percentage you may need to share your end goal. Maybe others can give a better solution than clustering.
This can also be interesting but afterwards you need to do some filtering/parsing yourself:
https://drive5.com/usearch/manual/cmd_allpairs_local.html
https://drive5.com/usearch/manual/cmd_allpairs_global.html
Thank you for your suggestions. I've performed aligning with kalign and prepare matrix identity with clustalo. I saw that my sequences are not very similar to each other and decided to cluster them with 30-40% (which is the mean of matrix identity). CD-HIT gave me absurd results, other programs didn't have an option to type identity thresholds.
In the meantime I found this question, which is very similar to mine - Clustering sequence on similarity using percentage identity matrix I will try mentioned solutions.
Best, Agata
How about generating tlsh hashes of your sequences and then using the
xref
command for getting pairwise distances? You can then cluster the resulting distance matrix (my preferred clustering algorithm for small to medium sized sets is affinity propagation). Depending how long your sequences are, it may be more sensible to create (perhaps) exhaustive mash sketches and then calculating their pairwise distances with thedist
command and again cluster the resulting distance matrix with affinity propagation or something else..