Question

SRA: is it possible to download files in the original format?

1

Entering edit mode

17 months ago

markgodek ▴ 50

Data management duties have lapsed in my lab. I'm trying to identify files in our systems that have been published to SRA.

I have a hash value for every file in our system, if I can download the exact file that was upload to SRA, I can get the hash value and cross check for duplicates. However, when files are uploaded to SRA, they are transformed into SRA objects from which you get the sequence using SRA toolkits fastq-dump.

Downloading the fastq using this tool yields a file of a different size from the originally uploaded file.

Are there certain command line options for fastq-dump I can specify to regenerate the exact file that was uploaded?

Thanks.

FASTQ SRA • 1.5k views

ADD COMMENT • link updated 4 months ago by GenoMax 147k • written 17 months ago by markgodek ▴ 50

score 3 · Accepted Answer · 2023-06-12

3

Entering edit mode

17 months ago

GenoMax 147k

If you look in Data access tab of a sample from your lab the original fastq files (as uploaded) will likely be available. Look at the Original Format section.

This is a random example to illustrate: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR19053315&display=data-access

ADD COMMENT • link 17 months ago by GenoMax 147k

0

Entering edit mode

I saw those (was using them for the size estimates mentioned). I didn't think they would be available for download via the s3 address, but I'll try it.

Thanks again.

Edit: following up

You have to use the SRA's cloud delivery service. You cannot download the original files from SRA's S3 bucket via AWS CLI.

ADD REPLY • link 17 months ago by markgodek ▴ 50

0

Entering edit mode

Did you have to pay?

ADD REPLY • link 4 months ago by Christina • 0

0

Entering edit mode

While technically you don't have to pay NCBI to get the data, you need to use the cloudy delivery as noted above. This requires you to create your own storage bucket in google cloud or amazon S3. Which mean you will end up with some cost since you have to pay for storage where the data will be delivered. This cost can vary where you are but is about 10 to 20 US cents per GB of data.

ADD REPLY • link 4 months ago by GenoMax 147k