Hey folks,
I am trying to download FASTQ files of the Human Genome Diversity Project (HGDP) with fastq-dump.
This is the command I am using (with for loop and in small number of RUNS):
fastq-dump-orig.3.0.9 --split-3 --gzip ERR1343843
The majority of downloaded files show errors/warnings (enlisted below). I used older version 3.0.0. and newer 3.0.9 but the errors pertain for 80% of fastqs.
Example 1 (with error and resulted in truncated file - the truncated files are reported as OK with gzip -t command):
2024-01-16T22:13:51 fastq-dump-orig.3.0.0 warn: database incorrect while opening manager within database module - can't open NC_000014.9 as a RefSeq or as a WGS
2024-01-16T22:13:51 fastq-dump-orig.3.0.0 err: database incorrect while opening manager within database module - failed ERR1344635
Example 2: (just a warning, the file seems OK based on file size):
2024-01-17T01:37:03 fastq-dump-orig.3.0.0 warn: database incorrect while opening manager within database module - can't open NT_187495.1 as a RefSeq or as a WGS
Read 51000044 spots for ERR1344382
Written 51000044 spots for ERR1344382
These errors are random. Since the same RUN is downloaded properly with no warning/error in the 3.0.0 version, but with the warning in 3.0.9. Some downloaded files with errors are correct, but some don't. So it is being random for me.
It is becoming a headache since I need to download ~2000 RUNS.
I would be glad for any suggestions.
Storage space-wise, I have ~20 TB. I am working in a supercomputer cluster.
I also tried fasterq-dump, but this tool needs 10x more space than the final output, which is ~60 TB space. Very time-consuming for each file to be downloaded but faster per the tool's performance. Basically, trying to pick the least of the evils.
Another option I tried is to download from ENA via FTP wget.
I usually check mdsum5 at the end and download the ones that are errorful. Usually, it can be just a few files but not in this case.
I just want to understand the nature of this issue. Obviously, I can just spend all the time this semester working through it manually.
Thanks in advance for any suggestions!