Question

How can I get sequencing data from NCBI with uniprot taxonomy identifiers? Automating with an API

0

Entering edit mode

10.7 years ago

wewolf • 0

Hello,

I am interested in downloading complete genomes to create a phylogenetic tree. The NCBI has a whole toolkit which they call Entrez Programming Utilities or eutils for short. (I found an EXCELLENT resource that walks me through everything I would need to know. Complete with a script in python to automate downloading these genomes off of NCBI.)

http://angus.readthedocs.org/en/2014/howe-ncbi.html#comment-1660809538

I have an "interesting-genomes.txt" file I'd like to find complete genomes for, HOWEVER this list of ID's contain the taxonomy identifier from uniprot ( ie http://www.uniprot.org/taxonomy/1000588).

For example, Streptococcus mitis bv. 2 str. SK95, has the corresponding taxonomy number of 1000588 in uniprot. In NCBI, it's ID is NC_013853.

I have a file containing a long list of taxonomy identifiers like 1000588, and not the NCBI ID's of NC_013853. Any ideas on how I can get around this?

Thank you!

sequencing genome biopython NCBI enterez • 4.4k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by wewolf • 0

0

Entering edit mode

The NCBI ID you provided seems like the contig number and not the taxID. The taxid for your organism of interest is: Streptococcus mitis bv. 2 str. SK95 (taxid:1000588). Which is the same as the Uniprot database.

ADD REPLY • link 8.3 years ago by theobroma22 ★ 1.2k

Ram · Answer 1 · 2014-10-30

This is tricky because there are usually many assemblies or genomes available for a given taxon. When you try to map a taxon id back to genomes using, say, Batch Entrez, you will end up retrieving a huge amount of sequences associated with this taxon id.

A possible way to get around this is to stick to representative genomes / assemblies, which guarantees you a one-to-one correspondence between taxon id and genome. In principle, this should work for almost all cases in your list excluding those that are sequenced very recently or have some weird strain-specific complications.

Download the following file including information on species names and refseq complete genome ids: ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prok_representative_genomes.txt

Then you can write a simple script to parse this file and extract the corresponding accession number (e.g., NC_013853) for a given species (e.g., Streptococcus mitis).