Question

Efficient Bulk Data Retrieval from NCBI BioProject

0

Entering edit mode

18 months ago

George ▴ 10

Hello,

A month ago, I utilized the SRA Toolkit Pipeline to download Fastq files from a BioProject accession. Following the recommended steps, I generated a list of SRR Names, used prefetch, and then employed fasterq-dump (using parallel-fastq-dump) to obtain the data locally, resulting in fq.gz files with the corresponding SRR names.

Recently, while composing a review for my project, I attempted prefetch with the BioProject accession name. Surprisingly, it not only worked but also downloaded the files in fq.gz format, a task that prefetch supposedly cannot perform. Furthermore, it downloaded the files using the original project ID names(as used in the paper, as opposed to SRR names). I am puzzled by this unexpected behavior and would appreciate any insights into why this occurred.

ncbi SRAtoolkit prefetch • 1.3k views

ADD COMMENT • link 18 months ago by George ▴ 10

0

Entering edit mode

Anecdotal evidence is hard to comment on. Give a precise code example for reproduction.

ADD REPLY • link 18 months ago by ATpoint 88k

0

Entering edit mode

Hey sorry if my post was not adequate. For retrieving all fq.gz data I just used prefetch PRJNA393611

ADD REPLY • link 18 months ago by George ▴ 10

1

Entering edit mode

Were you using two separate versions of sratoolkit at the two times? Functionality is routinely added with newer versions. Additional command line options may have been added to change the default behavior. There can be many explanations.

ADD REPLY • link 18 months ago by GenoMax 151k

0

Entering edit mode

I guess it was something they added recently and there is no documentation for this, although I find it peculiar that the downloaded date lacks the dataset name and instead displays the names added by the authors. Regardless, I hope that in the future, someone discovers this post and opts to execute prefetch using the library name, bypassing the need for a script to retrieve all SRRs as I did:

for i in {344..819}; do prefetch SRX3057$i; done

To answer your question, no the version was the same but I didn't know prefetch could download all fastq with just the library name. Anyway, thank you for your insights!

ADD REPLY • link 18 months ago by George ▴ 10