Creating BLASTN database for viruses
1
0
Entering edit mode
3.0 years ago

Hello guys, I need to create a local BLASTN db containing the fasta RefSeq sequences of viruses. I would like to download the fasta sequences from https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus using the browser, but it takes too long time and the connection fails (I am in China). I also tried to use the ncbi-genome-download application (ncbi-genome-download --formats fasta viral), but it is still running after 24h. Please, can someone help me? Any suggestions? Thank you very much

VSSI BLASTN database NCBI virus • 1.5k views
ADD COMMENT
1
Entering edit mode
3.0 years ago
Mensur Dlakic ★ 28k

What you are trying to do can't be done quickly, so your ncbi-genome-download job might still finish as intended.

I suggest you try genome_updater. This command will download all viral genomes in RefSeq:

genome_updater.sh -d "refseq" -g "viral" -c "all" -f "genomic.fna.gz" -o "all_virus_genomes" -t 12

You may be tempted to use more than 12 threads. I caution you not to do it, even if you have more threads available. What will happen is that NCBI will throttle down your connections if your IP number tries to download too many files in a small period of time. Sometimes using 10 threads will get the job done sooner than using 20.

ADD COMMENT
0
Entering edit mode

I recommend `genome_updater too, it can re-download failed files. I've used it to download virus genomes.

Here are some useful suggestions that may help. Yes, we met many times with unstable network connections.

unexpected EOF error

While some files could corrupt during downloading, we recommend checking sequence file integrity using seqkit (gzip -t failed for some files in my tests).

  1. List corrupted files

     # corrupted files
     find $genomes -name "*.gz" \
         | rush 'seqkit seq -w 0 {} > /dev/null; if [ $? -ne 0 ]; then echo {}; fi' \
         > failed.txt
    
     # empty files
     find $genomes -name "*.gz" -size 0 >> failed.txt
    
  2. Delete these files:

     cat failed.txt | rush '/bin/rm {}'
    
  3. Redownload these files. URLs can be found in files like 2021-09-30_13-32-30_url_downloaded.txt, you can extract URLs using grep -f failed.txt *url_downloaded.txt or something else, and batch redownload them using parallel.

ADD REPLY
0
Entering edit mode

Thank you very much for your help. I am trying to use genome_updater.sh (as suggested), but I got his esoteric message:

(base) emastriani@bilbo-06:/storage/Genome-Downloader$ ./genome_updater.sh -d "refseq" -g "viral" -c "all" -f "genomic.fna.gz" -o "all_virus_genomes" -t 4 parallel not found (base) emastriani@bilbo-06:/storage/Genome-Downloader$

Do you know the meaning? It sounds strange to me

ADD REPLY
0
Entering edit mode

It's my fault. genome_updater.sh requires the "parallel" package installed. Working on Ubuntu, the "apt install parallel" command solved the problem. genome_updater.sh is working:

(base) emastriani@bilbo-06:/storage/NCBI-ViralRefSeq$ ../Genome-Downloader/genome_updater.sh -d "refseq" -g "viral" -c "all" -f "genomic.fna.gz" -o "all_virus_genomes" -t 4

┌─┐┌─┐┌┐┌┌─┐┌┬┐┌─┐ ┬ ┬┌─┐┌┬┐┌─┐┌┬┐┌─┐┬─┐ │ ┬├┤ ││││ ││││├┤ │ │├─┘ ││├─┤ │ ├┤ ├┬┘ └─┘└─┘┘└┘└─┘┴ ┴└─┘────└─┘┴ ─┴┘┴ ┴ ┴ └─┘┴└─

                                 v0.2.5 

Mode: NEW - DOWNLOAD

Working directory: /remote-storage/NCBI-ViralRefSeq/all_virus_genomes/

Downloading assembly summary [2021-12-02_11-37-59]

  • 12856 entries available
  • 6 entries removed with filters: RefSeq category=all, Assembly level=all, Version status=latest, valid URLs
  • 12850 entries to be downloaded
  • Downloading 12850 files with 4 threads

Thank you

ADD REPLY

Login before adding your answer.

Traffic: 2105 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6