Question

Creating BLASTN database for viruses

0

Entering edit mode

3.0 years ago

emiliomastriani ▴ 40

Hello guys, I need to create a local BLASTN db containing the fasta RefSeq sequences of viruses. I would like to download the fasta sequences from https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus using the browser, but it takes too long time and the connection fails (I am in China). I also tried to use the ncbi-genome-download application (ncbi-genome-download --formats fasta viral), but it is still running after 24h. Please, can someone help me? Any suggestions? Thank you very much

VSSI BLASTN database NCBI virus • 1.5k views

ADD COMMENT • link 3.0 years ago by emiliomastriani ▴ 40

score 1 · Answer 1 · 2021-12-01

1

Entering edit mode

3.0 years ago

Mensur Dlakic ★ 28k

What you are trying to do can't be done quickly, so your ncbi-genome-download job might still finish as intended.

I suggest you try genome_updater. This command will download all viral genomes in RefSeq:

genome_updater.sh -d "refseq" -g "viral" -c "all" -f "genomic.fna.gz" -o "all_virus_genomes" -t 12

You may be tempted to use more than 12 threads. I caution you not to do it, even if you have more threads available. What will happen is that NCBI will throttle down your connections if your IP number tries to download too many files in a small period of time. Sometimes using 10 threads will get the job done sooner than using 20.

ADD COMMENT • link 3.0 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

I recommend `genome_updater too, it can re-download failed files. I've used it to download virus genomes.

Here are some useful suggestions that may help. Yes, we met many times with unstable network connections.

unexpected EOF error

While some files could corrupt during downloading, we recommend checking sequence file integrity using seqkit (gzip -t failed for some files in my tests).

List corrupted files

 # corrupted files
 find $genomes -name "*.gz" \
     | rush 'seqkit seq -w 0 {} > /dev/null; if [ $? -ne 0 ]; then echo {}; fi' \
     > failed.txt

 # empty files
 find $genomes -name "*.gz" -size 0 >> failed.txt

Delete these files:
```
 cat failed.txt | rush '/bin/rm {}'
```
Redownload these files. URLs can be found in files like 2021-09-30_13-32-30_url_downloaded.txt, you can extract URLs using grep -f failed.txt *url_downloaded.txt or something else, and batch redownload them using parallel.

ADD REPLY • link 3.0 years ago by shenwei356 8.7k

0

Entering edit mode

Thank you very much for your help. I am trying to use genome_updater.sh (as suggested), but I got his esoteric message:

(base) emastriani@bilbo-06:/storage/Genome-Downloader$ ./genome_updater.sh -d "refseq" -g "viral" -c "all" -f "genomic.fna.gz" -o "all_virus_genomes" -t 4 parallel not found (base) emastriani@bilbo-06:/storage/Genome-Downloader$

Do you know the meaning? It sounds strange to me

ADD REPLY • link 3.0 years ago by emiliomastriani ▴ 40

0

Entering edit mode

It's my fault. genome_updater.sh requires the "parallel" package installed. Working on Ubuntu, the "apt install parallel" command solved the problem. genome_updater.sh is working:

(base) emastriani@bilbo-06:/storage/NCBI-ViralRefSeq$ ../Genome-Downloader/genome_updater.sh -d "refseq" -g "viral" -c "all" -f "genomic.fna.gz" -o "all_virus_genomes" -t 4

┌─┐┌─┐┌┐┌┌─┐┌┬┐┌─┐ ┬ ┬┌─┐┌┬┐┌─┐┌┬┐┌─┐┬─┐ │ ┬├┤ ││││ ││││├┤ │ │├─┘ ││├─┤ │ ├┤ ├┬┘ └─┘└─┘┘└┘└─┘┴ ┴└─┘────└─┘┴ ─┴┘┴ ┴ ┴ └─┘┴└─

                                 v0.2.5

Mode: NEW - DOWNLOAD

Working directory: /remote-storage/NCBI-ViralRefSeq/all_virus_genomes/

Downloading assembly summary [2021-12-02_11-37-59]

12856 entries available
6 entries removed with filters: RefSeq category=all, Assembly level=all, Version status=latest, valid URLs
12850 entries to be downloaded
Downloading 12850 files with 4 threads

Thank you

ADD REPLY • link 3.0 years ago by emiliomastriani ▴ 40