Here is what I got from SRA about a week ago. In short, you'll likely need an AWS or GCP account and you may need to pay download costs unless you are using the data in the cloud. Full documentation is apparently being prepared.
The SRA Toolkit is needed for ETL data and the default toolkit configuration enables it to find and retrieve SRA runs by accession.
You can use the SRA Run Selector with Study, Sample, or Experiment accessions or an Entrez search to select a list of interesting SRA runs.
Many files are also available in either the Google Cloud Platform (GCP) or Amazon Web Services (AWS) but may require the user to have an account with that provider to access the files and pay egress charges to access the data outside of that cloud provider's platform.
The Free Egress column describes where the data can be accessed without an egress charge.
Worldwide - This data can be downloaded from anywhere without paying an egress fee.
s3.us-east-1 - This data is free to access for machines running in Amazon's us-east-1 region, all other regions or transport outside of Amazon will require paying egress charges.
Access Type describes whether a user account is necessary for data access or if the data can be accessed anonymously.
Primary ETL The file format that has been traditionally distributed
from SRA and used by the SRA Toolkit to read or output into formats
like FASTQ, SAM, etc. This data is normalized during the extract,
transform, and load (ETL) process at SRA.
Original The source data that was submitted to SRA and has not gone
through the ETL process. These files may require specific software to
open and read.
Analysis (previously called Secondary ETL) These files are a further
analysis of the data available in the run, but may not be present for
all runs. May include items like realignments, wgMLST, VCF, etc.
While someone may respond I suggest that you send an official ticket in to NCBI using this form. Use "Write to the Help Desk" button on right. Please update this thread when you hear back from them.
I sent a ticket in to see if SRA google/amazon bucket links were available for public use. Will post an update in other thread you have.
NCBI SRA support indicated that the cloud services are not ready for public use (as of early August 2019). Some data downloads will requirement payment, some not. Public announcement about cloud services will be coming in near future.
Thanks, I'm still bouncing emails back and forth between NLM. They were confused why their AMI wasn't available, but it seems like it's just a matter of days/weeks until everything is up and running.
Maybe I'm doing something wrong but I've configured the AWS CLI on my local computer with appropriate
us-east-1
region and tried to download an SRA file:It says it's forbidden (403). From this thread it seems to me that easy downloading from AWS is something they want to provide. I should be allowed without needing to spin up or log into an EC2 instance. I just want a better/faster way to get .sra files, using
prefetch
can be problematic, i.e. download failures, long pauses in the download, and very slow/erratic download speeds.Do yourself a favor and download directly in fastq format from ENA: Fast download of FASTQ files from the European Nucleotide Archive (ENA)
(only works for non-restricted data, most datasets are non-restricted)
ATpoint do you have any advice on my post on that thread? A: Fast download of FASTQ files from the European Nucleotide Archive (ENA)
hermidalc : Per conversation with NCBI support AWS/google buckets are NOT available for public use as yet (Sept 2019) even though the links have started appearing in SRA records.
They are supposed to go live later this year after an announcement is made by NCBI.
This is a great outcome!
https://www.amazon.science/latest-news/aws-democratizes-access-to-the-largest-genomic-sequences-repository-nihs-sequence-read-archive
If someone uses this can they confirm the following?