Hello guys,
I need to create a local BLASTN db containing the fasta RefSeq sequences of viruses. I would like to download the fasta sequences from https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus using the browser, but it takes too long time and the connection fails (I am in China). I also tried to use the ncbi-genome-download application (ncbi-genome-download --formats fasta viral), but it is still running after 24h. Please, can someone help me? Any suggestions? Thank you very much
You may be tempted to use more than 12 threads. I caution you not to do it, even if you have more threads available. What will happen is that NCBI will throttle down your connections if your IP number tries to download too many files in a small period of time. Sometimes using 10 threads will get the job done sooner than using 20.
I recommend `genome_updater too, it can re-download failed files. I've used it to download virus genomes.
Here are some useful suggestions that may help. Yes, we met many times with unstable network connections.
unexpected EOF error
While some files could corrupt during downloading, we recommend checking
sequence file integrity using seqkit (gzip -t failed for some files in
my tests).
Redownload these files. URLs can be found in files like 2021-09-30_13-32-30_url_downloaded.txt, you can extract URLs using grep -f failed.txt *url_downloaded.txt or something else, and batch redownload them using parallel.
It's my fault. genome_updater.sh requires the "parallel" package installed. Working on Ubuntu, the "apt install parallel" command solved the problem. genome_updater.sh is working:
I recommend
`genome_updater
too, it can re-download failed files. I've used it to download virus genomes.Here are some useful suggestions that may help. Yes, we met many times with unstable network connections.
unexpected EOF error
While some files could corrupt during downloading, we recommend checking sequence file integrity using seqkit (
gzip -t
failed for some files in my tests).List corrupted files
Delete these files:
Redownload these files. URLs can be found in files like
2021-09-30_13-32-30_url_downloaded.txt
, you can extract URLs usinggrep -f failed.txt *url_downloaded.txt
or something else, and batch redownload them usingparallel
.Thank you very much for your help. I am trying to use genome_updater.sh (as suggested), but I got his esoteric message:
(base) emastriani@bilbo-06:/storage/Genome-Downloader$ ./genome_updater.sh -d "refseq" -g "viral" -c "all" -f "genomic.fna.gz" -o "all_virus_genomes" -t 4 parallel not found (base) emastriani@bilbo-06:/storage/Genome-Downloader$
Do you know the meaning? It sounds strange to me
It's my fault. genome_updater.sh requires the "parallel" package installed. Working on Ubuntu, the "apt install parallel" command solved the problem. genome_updater.sh is working:
(base) emastriani@bilbo-06:/storage/NCBI-ViralRefSeq$ ../Genome-Downloader/genome_updater.sh -d "refseq" -g "viral" -c "all" -f "genomic.fna.gz" -o "all_virus_genomes" -t 4
┌─┐┌─┐┌┐┌┌─┐┌┬┐┌─┐ ┬ ┬┌─┐┌┬┐┌─┐┌┬┐┌─┐┬─┐ │ ┬├┤ ││││ ││││├┤ │ │├─┘ ││├─┤ │ ├┤ ├┬┘ └─┘└─┘┘└┘└─┘┴ ┴└─┘────└─┘┴ ─┴┘┴ ┴ ┴ └─┘┴└─
Mode: NEW - DOWNLOAD
Working directory: /remote-storage/NCBI-ViralRefSeq/all_virus_genomes/
Downloading assembly summary [2021-12-02_11-37-59]
Thank you