My work involves comparing similar prokaryotic organisms, and since Uniprot reduced their coverage by determining many proteomes 'redundant,' I can no longer rely on Uniref90 or Uniref50 to aid in clustering proteins based on sequence similarity. Apparently Uniref uses Uniprot, not Uniparc, as its domain.
It's important to note that Uniprot is making the redundancy determination on a proteome by proteome basis, not on a protein basis, so typically a handful of proteins that appear novel in each 'redundant' proteome cannot be found in Uniprot. They are, however, in Uniparc.
Currently my work-around involves clustering all these Uniparc but not Uniprot proteins separately -- the majority cluster with existing Uniref sequences, but many do not. I'm using Usearch from drive5. It works, but is time consuming and requires creating my own protein clusters.
I'm curious if others are dealing with a similar problem, and if they have found any community solutions.
Thanks for your perspective.
For those of us interested in a complete as possible picture of prokaryotic protein space, I think we now have to move to Uniparc as the primary reference. It might not be your case, but a significant number of the proteins made 'redundant' actually have no clear homolog in Uniprot, emphasizing that the judgement to redundantize is done on a genome by genome basis.
Unfortunately, Uniparc is not as well supported. Most of the reference mapping between Uniparc and other databases is done thru Uniprot, so if a protein has been redundatized, that Uniparc sequence cannot be easily mapped. There is a 60GB xml (!!!) file, uniparc_match, that is of some use.