While downloading the human proteome in fasta format from the uniprot site, I noticed that it was mentioned that there was one protein per sequence (20,594). However, above the protein count is mentioned (81,837) and this made me wonder. I need this file to interpret spectrums obtained from bottom up proteomics experiment. Doesn't this give a very bad representation of the proteins present? Additionally, how is it decided which sequence they display if alternative splicing occurs at a gene? Lastly, is there an alternative approach that searches the entire proteome rather than the gene-centered subset?
You could use "unreviewed" Human set (186K): https://www.uniprot.org/uniprotkb?facets=reviewed%3Afalse%2Cmodel_organism%3A9606&query=Human
Use
Protein Existence
filters in left column to trim this down (transcript level etc).