PacBio file type from NCBI run download
0
0
Entering edit mode
5.3 years ago
noodle ▴ 590

Let's say I download a pacbio run file from the NCBI SRA dataset shown here; https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR9849809

download link here; https://sra-download.ncbi.nlm.nih.gov/traces/sra2/SRR/009618/SRR9849809

Can someone tell me the format of the downloaded SRR9849809 file? It doesn't seem to be a standard compressed format, unless I missed something.

Thanks!

pacbio ncbi sra • 3.5k views
ADD COMMENT
1
Entering edit mode

See Tutorial: How to download raw sequence data from GEO/SRA How to download raw sequence data from GEO/SRA . Although in this case you don't need to split files.

Or directly download the FASTQ (sequences) from ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR984/009/SRR9849809/SRR9849809_subreads.fastq.gz

ADD REPLY
0
Entering edit mode

Thanks you!

A related but different question - I noticed there are s3 and gs bucket listings. Do you know if SRA has a public bucket? Or is there a way to request access? Thanks again :)

s3://sra-pub-run-4/SRR9849809/SRR9849809.1
gs://sra-pub-run-4/SRR9849809/SRR9849809.1

ADD REPLY
1
Entering edit mode

You can install gcloud utilities (part of Cloud SDK) on your server. You can then copy data directly from google bucket gsutil cp gs://sra-pub-run-4/SRR9849809/SRR9849809.1 your_local_disk

Update: Even though the google storage bucket is public, it appears that you have to pay to download the data (Bucket is requester pays bucket).

AWS command line utility provides similar functionality for Amazon buckets.

ADD REPLY
0
Entering edit mode

Thanks, unfortunately it seems like these are not public buckets.

$ aws s3 cp s3://sra-pub-run-4/SRR9849809/SRR9849809.1 ./
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

$ gsutil cp gs://sra-pub-run-4/SRR9849809/SRR9849809.1 ./
BadRequestException: 400 Bucket is requester pays bucket but no user project provided.
ADD REPLY
0
Entering edit mode

I think buckets are public. For data egress you need to pay. So you will have to provide a valid google compute/cloud project name for billing.

ADD REPLY
0
Entering edit mode

Any idea where I can initiate this? It's not so obvious clicking around the NCBI/SRA website...I'll start a new thread

ADD REPLY
0
Entering edit mode

Any idea where I can initiate this?

Initiate? You need a valid google compute account (which can be set up using directions here). Generally you would have access via your institution (since they will pay for your account). Unless you intend to use google compute for analysis you may be best off getting the data via ENA link provided by @Jean above.

ADD REPLY
0
Entering edit mode

yes, of course. I regularly use AWS and have the gs utils installed as well. It seems we're right at the transition period. https://www.nlm.nih.gov/news/NLM_Moves_SRA_Cloud.html

ADD REPLY
1
Entering edit mode

As long as NCBI keeps free access available via sratoolkit (and ENA keeps fastq files available, which I believe they have committed to doing) all should be well. Not everyone would be able to have google/AWS accounts and pay for data downloads.

ADD REPLY
0
Entering edit mode

Sure, fasterq-dump is great.

I started a new thread here if you want to follow.

NCBI SRA AWS AMI

ADD REPLY
1
Entering edit mode

joe : NCBI SRA support indicated that the cloud services are not ready for public use (August 2019). Some data downloads will requirement payment, some not. Public announcement about cloud services will be coming in near future.

ADD REPLY
0
Entering edit mode

I have no idea. I know there is a public ftp.

ADD REPLY

Login before adding your answer.

Traffic: 2662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6