How to not get an overcounting of proteins sequences in database ?
1
0
Entering edit mode
6.9 years ago
yayuciara • 0

I downloaded fasta files of proteins from Uniprot's database : ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Bacteria/

When i count number of proteins sequences of each file, i get an overcounting of proteins (isoforms), Is there a solution to exclude proteins isoforms ?

Thanks.

protein uniprot fasta • 1.3k views
ADD COMMENT
1
Entering edit mode

See this post and the third answer there (about part of FAQ).

Retrieving Uniprot Protein Isoform Sequences Programmatically?

A recent helpful post is this one:

isoforms and the definition of a protein

ADD REPLY
1
Entering edit mode

Have you looked at this page at UniProt?
This also may be useful.

ADD REPLY
0
Entering edit mode
6.9 years ago

Have you read the README file in the top-level directory? ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/README

In the reference_proteome directory, we have attempted to automatically split all reference proteomes into 2 sets, the so-called "canonical" sequences ("one entry per gene") and "additional" sequences. Since this gene-centric procedure is fully automatic, the term "canonical" is used slightly differently than in the context of manual curation.

Do you have examples of organisms where you observe significant discrepancies?

ADD COMMENT

Login before adding your answer.

Traffic: 2544 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6