Number of protein-coding genes per organism given the TaxID
2
0
Entering edit mode
8.2 years ago
bordin89 • 0

Hi,

I am currently trying to find for a list of 16K proteomes retrieved by UniProtKB the number of protein-coding genes for each one of them. What I would like to achieve is

UniProt TaxID Organism Protein numbers Protein-coding genes

83333 Escherichia coli strain K12 ------- ---------

I am able to fetch for some of the proteomes this kind of information, using a EFetch, but it will work if the TaxID is the same for UniProt and GenBank (like 9606 in the case of Human). E.coli i.e is problematic, because in the NCBI Taxonomy the TaxID 83333 is a collection of all the E.coli strains and in the output from Efetch there are no genes associated to that TaxID. The solution of parsing the output of efetch using the organism name is a pain because UniProt and Genbank have slight variations also on the Organism name (E.coli strain K12 for UniProt, E.coli str. K-12) and almost every proteome Name has a slight (but different everytime) variation in the Organism name.

Do you have any suggestions on how to achieve this?

Thank you.

Taxonomy NCBI Gene UniProt • 2.2k views
ADD COMMENT
0
Entering edit mode
8.2 years ago
EagleEye 7.6k

Example to get all genes for Drosophila Melanogaster,

http://rest.kegg.jp/list/dme

'dme' is the organism code. You can get the organism codes from following link (second column),

http://rest.kegg.jp/list/organism

T00007  eco Escherichia coli K-12 MG1655    Prokaryotes;Bacteria;Gammaproteobacteria - Enterobacteria;Escherichia
T00068  ecj Escherichia coli K-12 W3110 Prokaryotes;Bacteria;Gammaproteobacteria - Enterobacteria;Escherichia
T00666  ecd Escherichia coli K-12 DH10B Prokaryotes;Bacteria;Gammaproteobacteria - Enterobacteria;Escherichia
T00913  ebw Escherichia coli BW2952 Prokaryotes;Bacteria;Gammaproteobacteria - Enterobacteria;Escherichia
T02541  ecok    Escherichia coli K-12 MDS42

Note: The above gene list only contains the gene having KEGG functions.

ADD COMMENT
0
Entering edit mode
8.2 years ago
bordin89 • 0

Thanks for the reply, but that doesn't suits what I was l looking for, since a lot of organisms are not present in KEGG. The main issue I guess is that UniProt usually groups an entire subgroup of organisms in one TaxID (like 83333 for E.coli K12) and one non-redundant proteome, while if you dump the NCBI Gene DB using a query like

"all[Filter] AND ("source_genomic"[properties] AND (gene_nucleotide_pos[filter] AND "genetype protein coding"[Properties]) AND alive[prop])"

in a EFetch script it will recover all the E.coli genes associated with their strain or version TaxID, not 83333 unluckily.

Can I modify the query somehow? Or there is a better way around this?

Thanks!

ADD COMMENT

Login before adding your answer.

Traffic: 2140 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6