I want to know about the species distribution of uniprot. How many human proteins does uni prot have. How many from other species. Is there any way to know about this information about the whole uniprot protein database?
I want to know about the species distribution of uniprot. How many human proteins does uni prot have. How many from other species. Is there any way to know about this information about the whole uniprot protein database?
$ curl -s "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz" |\
gunzip -c | grep -E '^OS ' | cut -c6- | sort | uniq -c | sort -n
(...)
4127 Dictyostelium discoideum (Slime mold).
4185 Bacillus subtilis (strain 168).
4431 Escherichia coli (strain K12).
5097 Schizosaccharomyces pombe (strain 972 / ATCC 24843) (Fission yeast).
5983 Bos taurus (Bovine).
6621 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast).
7875 Rattus norvegicus (Rat).
12545 Arabidopsis thaliana (Mouse-ear cress).
16642 Mus musculus (Mouse).
20273 Homo sapiens (Human).
A place to start is the UniProt statistics pages:
These include details of the taxonomic distribution of the current UniProtKB entries.
UniProt browse by taxonomy is a way to explore the taxonomic distribution for all of UniProtKB. However, as UniProt uses the NCBI taxonomy there are things in there that can surprise the unaware biologist. For example. Homo sapiens, has two subspecies neanderthalensis and ssp. Denisova (don't ask me why, it just is... ). An other is that up to now there was basically a 1 to 1 taxid to genome project for bacterial species/strains/subspecies. Which is going to change soon.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
UniProt uses a modified version of the NCBI Taxonomy (see UniProt Taxonomy) which:
The taxonomy identifiers (e.g. 9606 for Homo sapiens) should be consistent between the two taxonomies so mapping between them should be simple.
The handling of archaeological taxa is always a matter of conjecture, since any classification is based on limited information and are subject to change as more examples are discovered and examined. The case of early humans it is unclear what the evolutionary relationships are since few examples are known (see Homo (genus))). For the moment NCBI Taxonomy has placed Denisova and Neanderthal as subspecies, presumably because this makes certain types of searches and analysis easier (e.g. using Homo sapiens to provide a reference genome), as the sequence data for these species improves this positioning will likely change to incorporate the new information.