SRA File Content
2
1
Entering edit mode
8.7 years ago
me ▴ 10

Once the SRA file from NCBI using prefetch -c <SRA number> has been downloaded...how do I see it's contents. Specifically, I want to see what columns/tables it has. No, I don't want the fasta/fastq files using fastq-dump, I want to actually get the movie metadata and .bax/.bas files from the PacBio sequencing run.

  1. How do I get these pieces of information?
  2. How do I see the columns/tables of the .sra file?
sequencing sra pacbio NGS • 4.3k views
ADD COMMENT
1
Entering edit mode
8.7 years ago
GenoMax 147k

AFAIK for PacBio data the information you want (*.h5) files is only available in the "download" tab of a PacBio data SRA record, ONLY if it was provided by the submitter. See this example (click on the "Download" tab).

ADD COMMENT
0
Entering edit mode

I know, but it just seems that this whole thing is fragmented. NCBI provides the SRA toolkit, but then it's different than the download file they provide in their link...and imo, there isn't enough documentation on this and the toolkit.

I think I must be missing something as I don't see how/why the SRA file would not contain the data which is clearly already on the record (in the "Download" tab). I assume this is in the SRA file under a table where we can retrieve it.

ADD REPLY
1
Entering edit mode

I had contacted SRA tech support about year back and this was their (paraphrased) response.

In cases where the data submitters provide original PacBio files (metadata.xml, *.bax.h5 and *.bas.h5 ) they can be found under the "Download" tab as a "tgz" archive.

Even though sra archive contains all of PacBio hdf5 data, it is not possible to reconstitute the original files since data formatting schema for PacBio hdf5 files is proprietary. As a result there is no "pacbio-dump" utility.

I am aware that PacBio hdf5 format is "not proprietary". But the above was what I had got from SRA support. You can ask them again, if anything has changed since.

ADD REPLY
0
Entering edit mode

Yet if we do fastq-dump ---fasta of the SRA file, this fasta file would basically have all the sequenced reads from that run correct?

ADD REPLY
0
Entering edit mode

That file should have all the reads that were generated from the raw data (*.h5) after processing with PacBio software (e.g. SMRTportal, "reads_of_insert" or "subreads", would depend on the analysis protocol used).

ADD REPLY
1
Entering edit mode

The data under the download tab is simply a link to the original compressed *.h5 files as provided by the submitters.

ADD REPLY
0
Entering edit mode

Given a list of SRR's and a little bit scripting one can build an automatic downloader based on this sort of command:

curl https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1772703 2>/dev/null | grep hdf5
ADD REPLY
0
Entering edit mode
8.7 years ago
curious ▴ 50

Hi,

I'm in the same boat as you're. I'm currently using SRAdb package (R) to retrieve SRA metadata using accession ID. Maybe it'll be useful to you.

I'm interested to get platform information and could find it.

ADD COMMENT
0
Entering edit mode

Thanks @curios. Were you able to extract the .bax/.bas and movie metadata files using this?

ADD REPLY
0
Entering edit mode

sorry, I'm not familiar with those formats

ADD REPLY

Login before adding your answer.

Traffic: 1414 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6