FASTQ-DUMP error while downloading HGDP datadset
3
0
Entering edit mode
11 months ago
Qboy ▴ 10

Hey folks,

I am trying to download FASTQ files of the Human Genome Diversity Project (HGDP) with fastq-dump.

This is the command I am using (with for loop and in small number of RUNS):

fastq-dump-orig.3.0.9 --split-3 --gzip ERR1343843

The majority of downloaded files show errors/warnings (enlisted below). I used older version 3.0.0. and newer 3.0.9 but the errors pertain for 80% of fastqs.

Example 1 (with error and resulted in truncated file - the truncated files are reported as OK with gzip -t command):

2024-01-16T22:13:51 fastq-dump-orig.3.0.0 warn: database incorrect while opening manager within database module - can't open NC_000014.9 as a RefSeq or as a WGS

2024-01-16T22:13:51 fastq-dump-orig.3.0.0 err: database incorrect while opening manager within database module - failed ERR1344635

Example 2: (just a warning, the file seems OK based on file size):

2024-01-17T01:37:03 fastq-dump-orig.3.0.0 warn: database incorrect while opening manager within database module - can't open NT_187495.1 as a RefSeq or as a WGS

Read 51000044 spots for ERR1344382

Written 51000044 spots for ERR1344382

These errors are random. Since the same RUN is downloaded properly with no warning/error in the 3.0.0 version, but with the warning in 3.0.9. Some downloaded files with errors are correct, but some don't. So it is being random for me.

It is becoming a headache since I need to download ~2000 RUNS.

I would be glad for any suggestions.

Storage space-wise, I have ~20 TB. I am working in a supercomputer cluster.

I also tried fasterq-dump, but this tool needs 10x more space than the final output, which is ~60 TB space. Very time-consuming for each file to be downloaded but faster per the tool's performance. Basically, trying to pick the least of the evils.

Another option I tried is to download from ENA via FTP wget.

I usually check mdsum5 at the end and download the ones that are errorful. Usually, it can be just a few files but not in this case.

I just want to understand the nature of this issue. Obviously, I can just spend all the time this semester working through it manually.

Thanks in advance for any suggestions!

hgdp fastqdump • 809 views
ADD COMMENT
2
Entering edit mode
11 months ago
ATpoint 86k

Ignore fastq-dump. Enter the accession at sra-explorer.info and get a direct fastq download link.

ADD COMMENT
1
Entering edit mode
11 months ago
GenoMax 148k

I assume this data is openly available since I see ftp links on HGDP page here (scroll down): https://www.internationalgenome.org/data-portal/data-collection/hgdp

If that is correct then you should get the from link above. You may need to change ftp to https in links.

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR129/009/ERR1295999/ERR1295999_2.fastq.gz

to

https://ftp.sra.ebi.ac.uk/vol1/fastq/ERR129/009/ERR1295999/ERR1295999_2.fastq.gz

You can also use sra-explorer to get the direct links for your samples sra-explorer : find SRA and FastQ download URLs in a couple of clicks

For hundreds of samples you should look into using aspera if that is a possibility.

ADD COMMENT
1
Entering edit mode
10 months ago
Qboy ▴ 10

For those who are interested and had the same issue.

Just download directly with url and wget: faster, less headache, no issues, easily verifiable with md5sum check and date.

Peace!

ADD COMMENT

Login before adding your answer.

Traffic: 1715 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6