From speaking with a few other pros, this was the solution in the end (though only very rough at the mo):
Use the ete3
toolkit to get a list of IDs:
from ete3 import NCBITaxa
import sys
taxon_name = sys.argv[1]
ncbi = NCBITaxa()
ncbi.update_taxonomy_database()
ebact = ncbi.get_descendant_taxa(taxon_name)
with open('./taxids', 'w') as ofh:
for i in ebact:
ofh.write("%s\n" % i)
# At this point, one could import ncbi-genome-download as a python method and continue
Which gave me a list of IDs (though this includes ALL descendent taxa, even ones without complete genomes etc).
I passed these to the latest version of ncbi-genome-download
which accepts a --taxid 12345,65890
format for specifiying the IDs.
So I just ran:
for file in * ;
do python ~/bin/ncbi-genome-download/ncbi-genome-download-runner.py -l complete -v -p 10 --taxid $(paste -s -d ',' "$file") bacteria ;
done
I had to run this iteratively on many files after I split my taxids
file up as there is a limit to how many args can be passed to --taxid
at once.
EDIT Sept 2018:
I contributed a script to the ncbi-genome-download
repo to make getting the TaxIDs nice and easy. It uses the approach above, but there’s no need to rewrite it for oneself now.
Wow, thanks for this answer. I've just learned about this useful
ncbi.get_descendant_taxa()
funcion. Funny, I use the same variable nameofh
for an output file and I always read it as output file handle.That’s exactly the way I intend it! It’s quite possible I’ve picked up the habit from some of your answers!