Hi,
I would like to ask if it would be possible to determine whether reads in a SRA object, could be CLR or Hifi using vdb-dump. I'm unsure which field or colum contains this information. Any guidance would be greatly appreciated.
Hi,
I would like to ask if it would be possible to determine whether reads in a SRA object, could be CLR or Hifi using vdb-dump. I'm unsure which field or colum contains this information. Any guidance would be greatly appreciated.
SRA typically mangles PacBio metadata, especially if the data has been uploaded as FASTA/FASTQ and the read names have been stripped. In this case, because the instrument metadata is Instrument: PacBio RS II
, we can tell that this is CLR data. "HiFi" data type was not available on the RS II and didn't show up until the Sequel II instrument, so this information wouldn't be available to be entered as metadata. There's a little more information about how the data was generated in the methods section of the publication.
If you have access to the BAM (which you can find by following the links to the Schatz lab website from the Data Access section in the publication, you can inspect the BAM header and read names.
In this case, the header doesn't give us much information, but the first read name can tell us a lot if we compare to the PacBio BAM spec.:
$ samtools view http://labshare.cshl.edu/shares/schatzlab/www-data/skbr3/reads_lr_skbr3.fa_ngmlr-0.2.3_mapped.bam | head -n1 | cut -f1
m141202_135223_42137_c100730962550000001823142605141547_s1_p0/136344/0_12909
HiFi reads, which are a filtered subset of CCS reads, have the naming pattern: {movieName}/{holeNumber}/ccs
.
CLR libraries have subreads for output, and the pattern is {movieName}/{holeNumber}/{qStart}_{qEnd}
. (Since PacBio instruments don't output subreads anymore, you have to go back to older BAM specs to find this pattern.)
The read above matches the pattern for a subread.
Billy Rowell : Is there a help page on PacBio site that lists the type of data produced by each sequencer model?
thanks Billy Rowell, I also examined that bam, but it was aligned against hg19 genome. It's also true that method section don't provide much information on filtering. I've also another FASTA file where reads are labeled with accession SRA followed by a number. I was unaware of the significance of these identifiers. As mentioned by GenoMax, where can I access this information. I'm seeking guidance on nanopore sequencing data, given that minimap2 makes distintion with nanopore data as well. (-ax map-ont or lr:hq
)
I'm not aware of a table that describes data type by instrument, but it's useful to have a few definitions up front.
CLR and HiFi are really describing _library_ types. The goal of CLR libraries are to generate templates that will provide continuous long reads, getting ~1 full read of the template per run. The goal of HiFi libraries is to generate templates that can be sequenced multiple times per template, and these templates tend to be shorter. The multiple passes from HiFi libraries are used to generate a single molecule computational consensus for the template sequence (CCS, circular consensus sequence). The CCS process existed before HiFi libraries. The big change with "HiFi" libraries is that the output is filtered such that the predicted error rate per read is <1% (>99%/Q20 accuracy). CCS consensus reads have been possible to generate since at least RS II, but people typically weren't making WGS libraries intended for CCS at the time.
SRA renames reads from uploaded FASTA/FASTQ files with sequential identifiers and strips the original names. Sometimes, especially with datasets generated more recently, people upload the unaligned BAM output directly from PacBio instruments and provide links from SRA. If available, I'd always recommend using these uBAMs instead of the data that has been processed by SRA. For your use case with this CLR dataset, using the processed reads from SRA will be fine.
For CLR (subreads), use minimap2 -ax map-pb ...
(link) or pbmm2 align --preset SUBREAD ...
(link).
pbmm2
is a PacBio-developed frontend for minimap2
with some convenience functions. In general, I would always recommend pbmm2 for newer datasets and compatibility with downstream PacBio tools, but minimap2 will also fine for your use case.
where reads are labeled with accession SRA followed by a number.
There is an option -F
for fastq-dump
which should remove that SRA accession and report the fastq headers in original formal (unless they were stripped by the submitters). Works for Illumina data I don't know if PacBio headers are always stripped by SRA.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I doubt there is going to be a specific field in SRA metadata that will track this type of information. Do you have any example accessions you are looking at?
Hi GenoMax, it is SRR7346978. Could you indicate in which columns this type of information are typically located?