Hello all, I'm trying to download and makedb for the nr.gz FASTA file from NCBI. I originally used wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
to download the nr.gz file. It worked (seemingly). But when I try to run $diamond makedb --in nr.gz -d nr
I get the following error:
#CPU threads: 64
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: /global/scratch/users/*****/*****/nr.gz
Opening the database file... [0.028s]
Loading sequences... [1.93s]
Error: Inflate error.
I then tried $fixgz nr.gz nr.fixed.gz
and ran diamond makedb again, and got the same error:
#CPU threads: 64
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1);
Database input file: /global/scratch/users/*****/*****/nr.fixed.gz
Opening the database file... [0.031s]
Loading sequences... [0.118s]
Error: Inflate error.
I've also tried to gunzip nr.gz and and nr.fixed.gz and get gzip: nr.fixed.gz: invalid compressed data--format violated
How do I successfully download the nr.gz file? It's huge and it sounds like ftp is often unstable, so the file gets corrupted? I've tried doing it multiple times with the same result. Is there an older version of nr.gz I could use?
Before you do anything, the integrity of your file can be tested using the
-t
switch:Beware that it will take a long time. If the file is corrupted, I suggest you try downloading it with aria2. In my hands it is much faster than wget because it uses multiple connections, and also has the ability to restart so there should be no issues with corruption.
Thanks! I'm downloading aria2 now. What options have you used in the past? I'm checking out the documentation, but in case you already have a line of code that would be helpful :)
That error suggests that your file is likely corrupt. Try @mensur's suggestion to confirm file integrity. Looks like you are using a central compute resource so the download should not run into problems. NCBI FTP is not unstable, if anything, it may be your local firewall that is causing the problem.
I just tested a fresh download of
nr.gz
anddiamond
started making the indexes without any error so the file at NCBI seems to be fine. Be sure to allocate enough RAM for this task if you are using a cluster.How long did it take you to download the gz file? I'm trying aria2 now and it's running, but I'm curious how long I should expect (it was nearly 10 hours using wget)
It was under 30 min using
wget
.So i'm running
$aria2c -x16 -k1M "ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz"
and it's been running for over an hour. The size of nr.gz is 117317190541The output says it's only at like 30%. Am I doing something wrong?
Probably. Again this could be due to multiple factors.
If you are at 30% now then another 2.5h should see the download complete.
The command I use with aria2:
It took ~26 minutes (~66 Mb/s). You may be tempted to use more than 4 connections, but NCBI may not like that and could throttle down your IP address.
Hm, it says that my speed is 14MiB/s on average. Is there a way to speed it up? I tried using a 60M minimum speed and it crashed.
I have already answered both of your last two questions - please read what I wrote. Sometimes less is more, so your 16 connections are probably causing NCBI to slow down your download. Also what GenoMax added, which explains other potential factors that may be unique to your internet connection.
Now, if it took you 10+ hours last time and it will likely be less than 5 with aria2, that's still a significant speed-up. Sometimes we just need to accept things as they are.
I did read your answers. I'm not asking because I'm concerned about time, I'm concerned that this has to do with why my file is being corrupted. After the download completed, it was still corrupted. gunzip -t nr.gz returned a formatting error
With all due respect, you specifically asked
Is there a way to speed it up?
which seems more of a concern about speed than file corruption.Both GenoMax and I downloaded today's copy of
nr.gz
without any issues, so I don't think anything is wrong with the file. That leaves software on your side (what is your gzip version? mine is 1.6), the integrity of your hard disk, or something with your internet connection as has already been pointed out. Maybe consulting your local admins will help you troubleshoot it.