Hi everyone,
The human proteome according to UniprotKB contains 20,370 reviewed proteins. I would like to create a matrix of size 20,370 x 20,370 containing all protein sequence identities or similarities (ranging from 0 to 1). I would very much appreciate any hints regarding the following:
(a) Have protein sequences identities or similarities have already been pre-computed and available for users to download? I am familiar with the UniRef clusters of 100%, 90% and 50% sequence identity, however what I am interested is rather on the pairwise sequence identities, not so much necessarily on the sequence clusters.
(b) There are a number of robust tools that have already been developed to calculate sequence similarities / identities and cluster proteins e.g. MMseqs2, clustal omega or blastall. Any other good tool that you may be familiar for an all-against-all pairwise sequence similarity calculation (?) It would be great if you could share on this thread.
Any hints would be greatly appreciated.
Thanks, Sergio
Not sure how you would come up with a score between 0 and 1. Proteins can be of very different sizes e.g. insulin vs titin. You could force them to all start at amino acid 1 but any identity matrix you generate would be a theoretical exercise.