I want to download the human and other completely sequenced proteomes in order to search for homologs. A uniprot search results in ~136500 sequences in case of human:
http://www.uniprot.org/uniprot/?query=taxonomy%3A9606&sort=score
Searching for a protein sequence among these sequences yields too many homologs in human which is impossible. CD-HIT filtering by 90% sequence identity does not not reduce the number of hits much. The reviewed ~20000 entries in case of human do not include all the human proteins. I am wondering if Ensembl would be a better choice.
I am aware of that. As far as I know human has <30k protein sequences excluding alternative splicing. Ensembl seem to have ~100k human CDS.