Question

PacBio file type from NCBI run download

0

Entering edit mode

5.4 years ago

noodle ▴ 600

Let's say I download a pacbio run file from the NCBI SRA dataset shown here; https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR9849809

download link here; https://sra-download.ncbi.nlm.nih.gov/traces/sra2/SRR/009618/SRR9849809

Can someone tell me the format of the downloaded SRR9849809 file? It doesn't seem to be a standard compressed format, unless I missed something.

Thanks!

pacbio ncbi sra • 3.5k views

ADD COMMENT • link 5.4 years ago by noodle ▴ 600

1

Entering edit mode

See Tutorial: How to download raw sequence data from GEO/SRA How to download raw sequence data from GEO/SRA . Although in this case you don't need to split files.

Or directly download the FASTQ (sequences) from ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR984/009/SRR9849809/SRR9849809_subreads.fastq.gz

ADD REPLY • link 5.4 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

Thanks you!

A related but different question - I noticed there are s3 and gs bucket listings. Do you know if SRA has a public bucket? Or is there a way to request access? Thanks again :)

s3://sra-pub-run-4/SRR9849809/SRR9849809.1
gs://sra-pub-run-4/SRR9849809/SRR9849809.1

ADD REPLY • link 5.4 years ago by noodle ▴ 600

1

Entering edit mode

You can install gcloud utilities (part of Cloud SDK) on your server. You can then copy data directly from google bucket gsutil cp gs://sra-pub-run-4/SRR9849809/SRR9849809.1 your_local_disk

Update: Even though the google storage bucket is public, it appears that you have to pay to download the data (Bucket is requester pays bucket).

AWS command line utility provides similar functionality for Amazon buckets.

ADD REPLY • link 5.4 years ago by GenoMax 148k

0

Entering edit mode

Thanks, unfortunately it seems like these are not public buckets.

$ aws s3 cp s3://sra-pub-run-4/SRR9849809/SRR9849809.1 ./
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

$ gsutil cp gs://sra-pub-run-4/SRR9849809/SRR9849809.1 ./
BadRequestException: 400 Bucket is requester pays bucket but no user project provided.

ADD REPLY • link 5.4 years ago by noodle ▴ 600

0

Entering edit mode

I think buckets are public. For data egress you need to pay. So you will have to provide a valid google compute/cloud project name for billing.

ADD REPLY • link 5.4 years ago by GenoMax 148k

0

Entering edit mode

Any idea where I can initiate this? It's not so obvious clicking around the NCBI/SRA website...I'll start a new thread

ADD REPLY • link 5.4 years ago by noodle ▴ 600

0

Entering edit mode

Any idea where I can initiate this?

Initiate? You need a valid google compute account (which can be set up using directions here). Generally you would have access via your institution (since they will pay for your account). Unless you intend to use google compute for analysis you may be best off getting the data via ENA link provided by @Jean above.

ADD REPLY • link 5.4 years ago by GenoMax 148k

0

Entering edit mode

yes, of course. I regularly use AWS and have the gs utils installed as well. It seems we're right at the transition period. https://www.nlm.nih.gov/news/NLM_Moves_SRA_Cloud.html

ADD REPLY • link 5.4 years ago by noodle ▴ 600

1

Entering edit mode

As long as NCBI keeps free access available via sratoolkit (and ENA keeps fastq files available, which I believe they have committed to doing) all should be well. Not everyone would be able to have google/AWS accounts and pay for data downloads.

ADD REPLY • link 5.4 years ago by GenoMax 148k

0

Entering edit mode

Sure, fasterq-dump is great.

I started a new thread here if you want to follow.

NCBI SRA AWS AMI

ADD REPLY • link 5.4 years ago by noodle ▴ 600

1

Entering edit mode

joe : NCBI SRA support indicated that the cloud services are not ready for public use (August 2019). Some data downloads will requirement payment, some not. Public announcement about cloud services will be coming in near future.