Hi all,
I have a list of accession numbers (GCF/A) and I want to download their complete genomes from NCBI in fasta
format.
I saw a lot of recommendation to use the NCBI datasets
and dataformat
tools, is it really the best option?
As far as I understand, I need to use datasets
with:
datasets download genome accession {acc} --exclude-gff3 --exclude-protein --exclude-rna > outdir/{acc}.zip
to get a zipped folder with a lot of un-relevant data inside, is there another tool that I can use maybe in python
to download fasta
from accession number directly?
Also, I want to download the metadata aswell, if I use:
datasets summary genome accession {acc} > outdir/{acc}.json
I will also need to convert it with:
dataformat tsv genome --input-file outdir/{acc}.json > outdir/{acc}.tsv
Am I correct in thinking that there should be a way to do this with less conversion and deleting useless data?
(like with the sratoolkit
..)
Any help will be much appreciated!
You can try the Bio.Entrez package, which gives you access to the Entrez utilities that are traditionally invoked from the command line. Given an accession number
$acc
, the command to retrieve the corresponding fasta file isefetch -db nuccore -format fasta -id $acc