Once the SRA file from NCBI using prefetch -c <SRA number> has been downloaded...how do I see it's contents. Specifically, I want to see what columns/tables it has. No, I don't want the fasta/fastq files using fastq-dump, I want to actually get the movie metadata and .bax/.bas files from the PacBio sequencing run.
AFAIK for PacBio data the information you want (*.h5) files is only available in the "download" tab of a PacBio data SRA record, ONLY if it was provided by the submitter. See this example (click on the "Download" tab).
I know, but it just seems that this whole thing is fragmented. NCBI provides the SRA toolkit, but then it's different than the download file they provide in their link...and imo, there isn't enough documentation on this and the toolkit.
I think I must be missing something as I don't see how/why the SRA file would not contain the data which is clearly already on the record (in the "Download" tab). I assume this is in the SRA file under a table where we can retrieve it.
I had contacted SRA tech support about year back and this was their (paraphrased) response.
In cases where the data submitters provide original PacBio files (metadata.xml, *.bax.h5 and *.bas.h5 ) they can be found under the
"Download" tab as a "tgz" archive.
Even though sra archive contains all of PacBio hdf5 data, it is not
possible to reconstitute the original files since data formatting
schema for PacBio hdf5 files is proprietary. As a result there is no
"pacbio-dump" utility.
I am aware that PacBio hdf5 format is "not proprietary". But the above was what I had got from SRA support. You can ask them again, if anything has changed since.
That file should have all the reads that were generated from the raw data (*.h5) after processing with PacBio software (e.g. SMRTportal, "reads_of_insert" or "subreads", would depend on the analysis protocol used).
I know, but it just seems that this whole thing is fragmented. NCBI provides the SRA toolkit, but then it's different than the download file they provide in their link...and imo, there isn't enough documentation on this and the toolkit.
I think I must be missing something as I don't see how/why the SRA file would not contain the data which is clearly already on the record (in the "Download" tab). I assume this is in the SRA file under a table where we can retrieve it.
I had contacted SRA tech support about year back and this was their (paraphrased) response.
I am aware that PacBio hdf5 format is "not proprietary". But the above was what I had got from SRA support. You can ask them again, if anything has changed since.
Yet if we do
fastq-dump ---fasta
of the SRA file, this fasta file would basically have all the sequenced reads from that run correct?That file should have all the reads that were generated from the raw data (*.h5) after processing with PacBio software (e.g. SMRTportal, "reads_of_insert" or "subreads", would depend on the analysis protocol used).
The data under the download tab is simply a link to the original compressed *.h5 files as provided by the submitters.
Given a list of SRR's and a little bit scripting one can build an automatic downloader based on this sort of command: