Question

How to not get an overcounting of proteins sequences in database ?

0

Entering edit mode

7.4 years ago

yayuciara • 0

I downloaded fasta files of proteins from Uniprot's database : ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Bacteria/

When i count number of proteins sequences of each file, i get an overcounting of proteins (isoforms), Is there a solution to exclude proteins isoforms ?

Thanks.

protein uniprot fasta • 1.4k views

ADD COMMENT • link updated 7.4 years ago by Elisabeth Gasteiger ★ 2.4k • written 7.4 years ago by yayuciara • 0

1

Entering edit mode

See this post and the third answer there (about part of FAQ).

Retrieving Uniprot Protein Isoform Sequences Programmatically?

A recent helpful post is this one:

isoforms and the definition of a protein

ADD REPLY • link 7.4 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

Have you looked at this page at UniProt?
This also may be useful.

ADD REPLY • link 7.4 years ago by GenoMax 151k

score 0 · Answer 1 · 2018-01-08

Have you read the README file in the top-level directory? ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/README

In the reference_proteome directory, we have attempted to automatically split all reference proteomes into 2 sets, the so-called "canonical" sequences ("one entry per gene") and "additional" sequences. Since this gene-centric procedure is fully automatic, the term "canonical" is used slightly differently than in the context of manual curation.

Do you have examples of organisms where you observe significant discrepancies?