Good morning, folks. I hope you're all right, in short, I wanted to get what there is a package to telesearch 2333 genomes from the NCBI database.
thank you all
Good morning, folks. I hope you're all right, in short, I wanted to get what there is a package to telesearch 2333 genomes from the NCBI database.
thank you all
For Clostridium difficile, you can either use NCBI Datasets command line application or the API. There is a Python library to parse the assembly descriptions and navigate the directory hierarchy that is described in a more detail here as well as Jupyter Notebooks that can be run on binder.
For the purpose of this post, I will use the command-line application. Assuming you have followed the instructions from this page and downloaded the application, follow the commands shown below:
## download assembly descriptors and make a list of assembly accessions
## NCBI Taxonomy ID for Clostridium difficiles is 1496
$ datasets assembly_descriptors tax_id 1496 -l 'ALL' | python -m json.tool > cdiff.json
## make a list of GCF accessions
$ grep -o 'GC[AF]_[0-9]*\.[0-9]*' cdiff.json | sort -u > cdiff.accs
## download data
$ datasets download assembly -i cdiff.accs
This will download a file ncbi_datasets.zip
which will have the genome sequences for >3000 Clostridium species in FASTA format. There are additional options to restrict the list to RefSeq assemblies only in datasets assembly_descriptors
command and additional file type options in the datasets download
command that may be of interest to you. I suggest you take a quick look at the documentation and the help files.
thanks a lot vkkodali, how I can change the .accs extension to .zip to find the different genomes in .fasta format ? then I anchor prokka for the annotation in order to retrieve the different 16S and create the phylogenetic tree. if you have a helper thank you to explain me. I'll thank you some other time.
The file cdiff.accs
is just a simple text file with a list of NCBI assembly accessions whose genomes you'd like to download. You can view this in any text editor such as Notepad. The downloaded data are in the ncbi_dataset.zip
file. The query returns nearly 4000 assemblies so that is the number of FASTA files you have in the ncbi_dataset.zip
archive. For prokka, do you need a single multi-fasta file with all of the genomes? On a Unix machine, if you want the former, you can use something like unzip -d cdiff_fasta/ ncbi_dataset.zip ncbi_dataset/data/GC*/GC*.fna
to download individual fasta files to cdiff_fasta directory (you may have to create the directory first). If you want the latter, then you can use unzip -p ncbi_dataset.zip ncbi_dataset.zip ncbi_dataset/data/GC*/GC*.fna > cdiff.fasta
I don't know how prokka works so I cannot go in to much detail. First, unzip the ncbi_dataset.zip
archive. This will create a few files and a directory called ncbi_dataset
. You can loop through all of the fasta files by doing something like for fa_file in ncbi_dataset/data/GC*/*.fna ; do ...
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Using ncbi-genome-download tool.
NCBI Datasets is a new NCBI resource designed specifically to address tasks like these. If you can describe a little what it is you are trying to download, I'd be able to help you more. What kind of genomes are these? Which file types are you interested in? And, what is your starting point -- a list of NCBI assembly accessions, species names, etc?
thank you for the help first, I'm working on the genome of a bacterium (clostridium difficile). I need all the genomes that have been deposited in NCBI to make a comparison with the ones we have. I wanted to download the assembled genomes.