Question

Determine type of Pacbio read

1

Entering edit mode

8 months ago

María José ▴ 10

Hi,

I would like to ask if it would be possible to determine whether reads in a SRA object, could be CLR or Hifi using vdb-dump. I'm unsure which field or colum contains this information. Any guidance would be greatly appreciated.

sratoolkit CLR Pacbio Hifi long-reads • 2.5k views

ADD COMMENT • link updated 8 months ago by GenoMax 153k • written 8 months ago by María José ▴ 10

0

Entering edit mode

I doubt there is going to be a specific field in SRA metadata that will track this type of information. Do you have any example accessions you are looking at?

ADD REPLY • link 8 months ago by GenoMax 153k

0

Entering edit mode

Hi GenoMax, it is SRR7346978. Could you indicate in which columns this type of information are typically located?

ADD REPLY • link 8 months ago by María José ▴ 10

score 1 · Answer 1 · 2024-12-17

1

Entering edit mode

8 months ago

Billy Rowell ▴ 510

SRA typically mangles PacBio metadata, especially if the data has been uploaded as FASTA/FASTQ and the read names have been stripped. In this case, because the instrument metadata is Instrument: PacBio RS II, we can tell that this is CLR data. "HiFi" data type was not available on the RS II and didn't show up until the Sequel II instrument, so this information wouldn't be available to be entered as metadata. There's a little more information about how the data was generated in the methods section of the publication.

If you have access to the BAM (which you can find by following the links to the Schatz lab website from the Data Access section in the publication, you can inspect the BAM header and read names.

In this case, the header doesn't give us much information, but the first read name can tell us a lot if we compare to the PacBio BAM spec.:

$ samtools view http://labshare.cshl.edu/shares/schatzlab/www-data/skbr3/reads_lr_skbr3.fa_ngmlr-0.2.3_mapped.bam | head -n1 | cut -f1
m141202_135223_42137_c100730962550000001823142605141547_s1_p0/136344/0_12909

HiFi reads, which are a filtered subset of CCS reads, have the naming pattern: {movieName}/{holeNumber}/ccs.

CLR libraries have subreads for output, and the pattern is {movieName}/{holeNumber}/{qStart}_{qEnd}. (Since PacBio instruments don't output subreads anymore, you have to go back to older BAM specs to find this pattern.)

The read above matches the pattern for a subread.

ADD COMMENT • link 8 months ago by Billy Rowell ▴ 510

0

Entering edit mode

Billy Rowell : Is there a help page on PacBio site that lists the type of data produced by each sequencer model?

ADD REPLY • link 8 months ago by GenoMax 153k

0

Entering edit mode

thanks Billy Rowell, I also examined that bam, but it was aligned against hg19 genome. It's also true that method section don't provide much information on filtering. I've also another FASTA file where reads are labeled with accession SRA followed by a number. I was unaware of the significance of these identifiers. As mentioned by GenoMax, where can I access this information. I'm seeking guidance on nanopore sequencing data, given that minimap2 makes distintion with nanopore data as well. (-ax map-ont or lr:hq)

ADD REPLY • link 8 months ago by María José ▴ 10

2

Entering edit mode

I'm not aware of a table that describes data type by instrument, but it's useful to have a few definitions up front.

CLR and HiFi are really describing _library_ types. The goal of CLR libraries are to generate templates that will provide continuous long reads, getting ~1 full read of the template per run. The goal of HiFi libraries is to generate templates that can be sequenced multiple times per template, and these templates tend to be shorter. The multiple passes from HiFi libraries are used to generate a single molecule computational consensus for the template sequence (CCS, circular consensus sequence). The CCS process existed before HiFi libraries. The big change with "HiFi" libraries is that the output is filtered such that the predicted error rate per read is <1% (>99%/Q20 accuracy). CCS consensus reads have been possible to generate since at least RS II, but people typically weren't making WGS libraries intended for CCS at the time.

RS II - CLR is primary data type, CCS possible for short libraries with compute off instrument
Sequel - CLR/CCS possible with compute off instrument
Sequel II - CLR/CCS possible, and the pbccs program can run directly on instrument to output "HiFi" data (CCS data >99% accuracy) directly from instrument
Revio - HiFi is the primary data type, it is no longer possible to get subreads from instrument
Vega - HiFi is the primary data type, it is no longer possible to get subreads from instrument

SRA renames reads from uploaded FASTA/FASTQ files with sequential identifiers and strips the original names. Sometimes, especially with datasets generated more recently, people upload the unaligned BAM output directly from PacBio instruments and provide links from SRA. If available, I'd always recommend using these uBAMs instead of the data that has been processed by SRA. For your use case with this CLR dataset, using the processed reads from SRA will be fine.

For CLR (subreads), use minimap2 -ax map-pb ... (link) or pbmm2 align --preset SUBREAD ... (link).

pbmm2 is a PacBio-developed frontend for minimap2 with some convenience functions. In general, I would always recommend pbmm2 for newer datasets and compatibility with downstream PacBio tools, but minimap2 will also fine for your use case.

ADD REPLY • link 8 months ago by Billy Rowell ▴ 510

0

Entering edit mode

Thanks, I would like to ask how to determine the specific columns of a SRA using vdb-dump. I have tried using the --columns option, but it shows the usage message. Moreover, when I use the -E option alone, I get only a line: tbl:Sequence."

ADD REPLY • link 8 months ago by María José ▴ 10

0

Entering edit mode

Curious as to why you want to do this?

ADD REPLY • link 8 months ago by GenoMax 153k

1

Entering edit mode

where reads are labeled with accession SRA followed by a number.

There is an option -F for fastq-dump which should remove that SRA accession and report the fastq headers in original formal (unless they were stripped by the submitters). Works for Illumina data I don't know if PacBio headers are always stripped by SRA.

ADD REPLY • link 8 months ago by GenoMax 153k

0

Entering edit mode

thanks GenoMax, I've used fasterq-dump but I don't see any argument similar to -F of fastq-dump.

ADD REPLY • link 8 months ago by María José ▴ 10

0

Entering edit mode

It appears that the pacbio headers have been stripped from the submitted data. Either by submitters or by SRA.

ADD REPLY • link 8 months ago by GenoMax 153k