NCBI SRA AWS AMI
1
3
Entering edit mode
5.3 years ago
noodle ▴ 590

I came across the below NCBI website with instructions to access the SRA dataset through AWS, but it seems the AMI they reference no longer exists. Anyone have a lead for how the SRA dataset can be accessed via AWS, or more specifically, how one might access data held on s3 buckets?

https://www.ncbi.nlm.nih.gov/sra/docs/sra-aws-download/

Update: Great news!

https://www.amazon.science/latest-news/aws-democratizes-access-to-the-largest-genomic-sequences-repository-nihs-sequence-read-archive

https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/

SRA NCBI AWS S3 • 7.0k views
ADD COMMENT
1
Entering edit mode

While someone may respond I suggest that you send an official ticket in to NCBI using this form. Use "Write to the Help Desk" button on right. Please update this thread when you hear back from them.

I sent a ticket in to see if SRA google/amazon bucket links were available for public use. Will post an update in other thread you have.

ADD REPLY
1
Entering edit mode

NCBI SRA support indicated that the cloud services are not ready for public use (as of early August 2019). Some data downloads will requirement payment, some not. Public announcement about cloud services will be coming in near future.

ADD REPLY
0
Entering edit mode

Thanks, I'm still bouncing emails back and forth between NLM. They were confused why their AMI wasn't available, but it seems like it's just a matter of days/weeks until everything is up and running.

ADD REPLY
0
Entering edit mode

Maybe I'm doing something wrong but I've configured the AWS CLI on my local computer with appropriate us-east-1 region and tried to download an SRA file:

aws s3 cp s3://sra-pub-run-3/SRR292241/SRR292241.3 SRR292241.sra

It says it's forbidden (403). From this thread it seems to me that easy downloading from AWS is something they want to provide. I should be allowed without needing to spin up or log into an EC2 instance. I just want a better/faster way to get .sra files, using prefetch can be problematic, i.e. download failures, long pauses in the download, and very slow/erratic download speeds.

ADD REPLY
1
Entering edit mode

Do yourself a favor and download directly in fastq format from ENA: Fast download of FASTQ files from the European Nucleotide Archive (ENA)

(only works for non-restricted data, most datasets are non-restricted)

ADD REPLY
0
Entering edit mode

ATpoint do you have any advice on my post on that thread? A: Fast download of FASTQ files from the European Nucleotide Archive (ENA)

ADD REPLY
1
Entering edit mode

hermidalc : Per conversation with NCBI support AWS/google buckets are NOT available for public use as yet (Sept 2019) even though the links have started appearing in SRA records.

They are supposed to go live later this year after an announcement is made by NCBI.

ADD REPLY
0
Entering edit mode

This is a great outcome!

https://www.amazon.science/latest-news/aws-democratizes-access-to-the-largest-genomic-sequences-repository-nihs-sequence-read-archive

ADD REPLY
0
Entering edit mode

If someone uses this can they confirm the following?

  • Is the data access only available within AWS or can the data be downloaded?
  • Is access at no cost (within AWS is likely) but outside AWS?
ADD REPLY
4
Entering edit mode
5.3 years ago

Here is what I got from SRA about a week ago. In short, you'll likely need an AWS or GCP account and you may need to pay download costs unless you are using the data in the cloud. Full documentation is apparently being prepared.

The SRA Toolkit is needed for ETL data and the default toolkit configuration enables it to find and retrieve SRA runs by accession.

You can use the SRA Run Selector with Study, Sample, or Experiment accessions or an Entrez search to select a list of interesting SRA runs.

Many files are also available in either the Google Cloud Platform (GCP) or Amazon Web Services (AWS) but may require the user to have an account with that provider to access the files and pay egress charges to access the data outside of that cloud provider's platform.

The Free Egress column describes where the data can be accessed without an egress charge. Worldwide - This data can be downloaded from anywhere without paying an egress fee. s3.us-east-1 - This data is free to access for machines running in Amazon's us-east-1 region, all other regions or transport outside of Amazon will require paying egress charges.

Access Type describes whether a user account is necessary for data access or if the data can be accessed anonymously.

Primary ETL The file format that has been traditionally distributed from SRA and used by the SRA Toolkit to read or output into formats like FASTQ, SAM, etc. This data is normalized during the extract, transform, and load (ETL) process at SRA.

Original The source data that was submitted to SRA and has not gone through the ETL process. These files may require specific software to open and read.

Analysis (previously called Secondary ETL) These files are a further analysis of the data available in the run, but may not be present for all runs. May include items like realignments, wgMLST, VCF, etc.

ADD COMMENT
1
Entering edit mode

you'll likely need an AWS or GCP account and you may need to pay download costs unless you are using the data in the cloud

That will lock out a number of users. I assume some form of free access will remain (e.g. sratoolkit) otherwise we will need to depend on ENA to provide fastq data.

ADD REPLY
0
Entering edit mode

I believe that SRA toolkit will remain as an option, at least for the foreseeable future. I know of no SRA datasets that are available in the cloud only as of now.

ADD REPLY
0
Entering edit mode

I'm having an issue with getting some fastqs that are only available in the cloud. If you go to: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10620024 and click on the Data access tab, there are some aws s3 links in a section title Original format to files that I want.

There are problems with the files downloadable by fastq-dump. SRP235541 is the accession and it is annotated as being paired end, but using fasterq-dump results in a single fastq file per run.

ADD REPLY
1
Entering edit mode

You can download this data file and then save with .sra extension. Then you can dump the reads out using

fastq-dump -F --split-files SRR10620024.sra

I got the expected 3 files with 8,26 and 98 bp reads.

::::::::::::::
SRR10620024_1.fastq
::::::::::::::
@K00335:298:HYCLMBBXX:7:1101:1550:1543
TCAGCCGT
+K00335:298:HYCLMBBXX:7:1101:1550:1543
A--AF-AA

::::::::::::::
SRR10620024_2.fastq
::::::::::::::
@K00335:298:HYCLMBBXX:7:1101:1550:1543
CAACCTCCAAAGGCGTAACTTTACAA
+K00335:298:HYCLMBBXX:7:1101:1550:1543
AAAFFJJJJJ<7<A<F7JJFJJJA<A

::::::::::::::
SRR10620024_3.fastq
::::::::::::::
@K00335:298:HYCLMBBXX:7:1101:1550:1543
GATNGCAGAATATGGAGTCATTATTAGAGACTAAGACGCTATGTATAGATGCACAAAGGATGGAGTCGCTCTGGTCTACACAAAGGTAAGAATTTTCC
+K00335:298:HYCLMBBXX:7:1101:1550:1543
A7-#7FA-AF--77-<7-----7A77----<AAJ-FJ----7-77--7----<-77---7<-AJ7-----7A----7-77AA----7---7---77--
ADD REPLY
0
Entering edit mode

Thanks! I guess I'll need some patience. Here are two interesting documents/webpages I found while searching for an answer.

https://www.nlm.nih.gov/news/NLM_Moves_SRA_Cloud.html https://datascience.nih.gov/sites/default/files/NIH_Strategic_Plan_for_Data_Science_Final_508.pdf

ADD REPLY

Login before adding your answer.

Traffic: 2500 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6