When i count number of proteins sequences of each file, i get an overcounting of proteins (isoforms), Is there a solution to exclude proteins isoforms ?
In the reference_proteome directory, we have attempted to automatically split all reference proteomes into 2 sets, the so-called "canonical" sequences ("one entry per gene") and "additional" sequences. Since this gene-centric procedure is fully automatic, the term "canonical" is used slightly differently than in the context of manual curation.
Do you have examples of organisms where you observe significant discrepancies?
See this post and the third answer there (about part of FAQ).
Retrieving Uniprot Protein Isoform Sequences Programmatically?
A recent helpful post is this one:
isoforms and the definition of a protein
Have you looked at this page at UniProt?
This also may be useful.