Question

SRA File Content

1

Entering edit mode

9.2 years ago

me ▴ 10

Once the SRA file from NCBI using prefetch -c <SRA number> has been downloaded...how do I see it's contents. Specifically, I want to see what columns/tables it has. No, I don't want the fasta/fastq files using fastq-dump, I want to actually get the movie metadata and .bax/.bas files from the PacBio sequencing run.

How do I get these pieces of information?
How do I see the columns/tables of the .sra file?

sequencing sra pacbio NGS • 4.6k views

ADD COMMENT • link updated 9.2 years ago by GenoMax 151k • written 9.2 years ago by me ▴ 10

score 1 · Answer 1 · 2016-03-09

1

Entering edit mode

9.2 years ago

GenoMax 151k

AFAIK for PacBio data the information you want (*.h5) files is only available in the "download" tab of a PacBio data SRA record, ONLY if it was provided by the submitter. See this example (click on the "Download" tab).

ADD COMMENT • link 9.2 years ago by GenoMax 151k

0

Entering edit mode

I know, but it just seems that this whole thing is fragmented. NCBI provides the SRA toolkit, but then it's different than the download file they provide in their link...and imo, there isn't enough documentation on this and the toolkit.

I think I must be missing something as I don't see how/why the SRA file would not contain the data which is clearly already on the record (in the "Download" tab). I assume this is in the SRA file under a table where we can retrieve it.

ADD REPLY • link 9.2 years ago by me ▴ 10

1

Entering edit mode

I had contacted SRA tech support about year back and this was their (paraphrased) response.

In cases where the data submitters provide original PacBio files (metadata.xml, *.bax.h5 and *.bas.h5 ) they can be found under the "Download" tab as a "tgz" archive.

Even though sra archive contains all of PacBio hdf5 data, it is not possible to reconstitute the original files since data formatting schema for PacBio hdf5 files is proprietary. As a result there is no "pacbio-dump" utility.

I am aware that PacBio hdf5 format is "not proprietary". But the above was what I had got from SRA support. You can ask them again, if anything has changed since.

ADD REPLY • link 9.2 years ago by GenoMax 151k

0

Entering edit mode

Yet if we do fastq-dump ---fasta of the SRA file, this fasta file would basically have all the sequenced reads from that run correct?

ADD REPLY • link 9.2 years ago by me ▴ 10

0

Entering edit mode

That file should have all the reads that were generated from the raw data (*.h5) after processing with PacBio software (e.g. SMRTportal, "reads_of_insert" or "subreads", would depend on the analysis protocol used).

ADD REPLY • link 9.2 years ago by GenoMax 151k

1

Entering edit mode

The data under the download tab is simply a link to the original compressed *.h5 files as provided by the submitters.

ADD REPLY • link 9.2 years ago by GenoMax 151k

0

Entering edit mode

Given a list of SRR's and a little bit scripting one can build an automatic downloader based on this sort of command:

curl https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1772703 2>/dev/null | grep hdf5

ADD REPLY • link 8.0 years ago by dmathog ▴ 40

score 0 · Answer 2 · 2016-03-09

0

Entering edit mode

9.2 years ago

curious ▴ 50

Hi,

I'm in the same boat as you're. I'm currently using SRAdb package (R) to retrieve SRA metadata using accession ID. Maybe it'll be useful to you.

I'm interested to get platform information and could find it.

ADD COMMENT • link 9.2 years ago by curious ▴ 50

0

Entering edit mode

Thanks @curios. Were you able to extract the .bax/.bas and movie metadata files using this?