UniRef50/UniRef90 are really useful clustered databases. I'm interested in trying a similar approach to this nested clustering but with my own protein database.
Are there specific commands that were used for UniRef clustering with MMSEQS2?
I couldn't find these documented anywhere.
This is awesome! Thank you. I stumbled across the "help" page https://www.uniprot.org/help/uniref which gives a general description. I've translated the description to commands using
easy-cluster
andeasy-linclust
. Does this seem to be in accord with your steps above using the more modular implementation?I think this is a faster solution that one may need to use for 100+ million sequences. Don't know how it compares to the solution I outlined above, but one needs to balance accuracy with resources. I have done about 3.5 million sequences as described above, and I think it was about a day for the first clustering step (to 90%). Subsequent steps are faster if you start from an already clustered database.
You seem to be mixing and matching
easy-cluster
andeasy-linclust
. Note that these are not the same algorithms. I'm also not sure what params uniref used, but the coverage mode and clustering mode may not match what you've used here either.