Question

What additional info is found in BAM files from PacBio HiFi compared to Fastq files

0

Entering edit mode

3 months ago

Eric Normandeau 11k

Background:

One of the sequencing facilities we work with used to provide both BAM and fastq.gz files for PacBio HiFi reads. Now they only provide BAM files. Since the BAM files are about 10 times bigger than the fastq.gz ones, I'm considering getting rid of them once I've create the fastq.gz files with samtools fastq.

For these projects, we're doing genome assembly.

Question:

What additional informations do the BAM files contain compared to the Fastq ones and for what kind of analyses are these useful?

bam hifi fastq pacbio • 1.3k views

ADD COMMENT • link updated 3 months ago by Billy Rowell ▴ 510 • written 3 months ago by Eric Normandeau 11k

1

Entering edit mode

Moved this to the answer below

ADD REPLY • link 3 months ago by Billy Rowell ▴ 510

0

Entering edit mode

Thank you!

I'd like to accept this as the correct answer if you agree to cut/paste into an answer rather than a comment reply.

ADD REPLY • link 3 months ago by Eric Normandeau 11k

0

Entering edit mode

PacBio describes their extended BAM format spec here --> https://pacbiofileformats.readthedocs.io/en/13.0/BAM.html

ADD REPLY • link 3 months ago by GenoMax 150k

0

Entering edit mode

Thanks. I just re-read the document in case I had skimmed it too fast the first time, but it doesn't quite answer the first part of my question and does not provide answers for the second part.

ADD REPLY • link 3 months ago by Eric Normandeau 11k

1

Entering edit mode

Looks like you will lose information about kinetics/base modifications but if you only care about sequence then fastq sequence may be enough.

I am going to tag Billy Rowell from PacBio. Hopefully we will get an authoritative answer.

ADD REPLY • link 3 months ago by GenoMax 150k

1

Entering edit mode

3 months ago

Istvan Albert 102k

The proper answer to this is that you have to look at the tags in the BAM file, those will explain what additional information is stored for each read. For example in one file, I see:

RG:Z:797d41a8   ac:B:i,60,0,36,24   ec:f:55.0811    ma:i:0  np:i:60 rq:f:0.999997   sn:B:f,9.44502,13.8063,3.47653,6.21549  we:i:10754634   ws:i:98819  zm:i:46465666
MM:Z:C+m,21,55,4,3,12,1,2,0,0,4,4,1,2,0,2,1,0,0,3,0,0,6,3,7,0,0,4,5,2,2,3,2,18,4,6,18,9,4,3,8,8,9,6,3,5,3,17,9,1,5,7,6,2,13,3,13,6,3,1,0,15,2,4,7,0,6,5,3,4,5,1,1,7,6,10,19,10,0,1,3,1,2,1,6,2,40,9,4,4,0,6,0,2,9,6,1,5,2,6,2,30,3,9,8,5,0,0,0,0,0,0,0,0;   
ML:B:C,5,19,213,3,16,0,0,0,0,3,30,137,9,20,4,5,8,2,0,8,18,100,8,14,62,230,12,3,23,135,89,201,12,2,77,4,16,185,28,8,1,18,16,215,3,17,10,6,5,75,22,123,2,7,50,42,207,1,8,0,0,23,9,117,3,21,125,13,137,54,109,31,20,81,12,15,34,1,66,8,3,17,2,19,134,193,2,1,0,2,4,0,199,2,129,10,1,11,13,111,34,7,2,61,1,11,3,26,6,0,4,1,10

it seems the largest piece of information is that of methylation tags.

I will say, though, that I was surprised that the size difference is 10x.

The additional information seems to be less than 2x more, but perhaps the answer here is that the FASTQ file compresses much better due to containing the same information.

ADD COMMENT • link 3 months ago by Istvan Albert 102k

1

Entering edit mode

Bases (N=4) and binned base quality scores (N=~7) are highly compressible. ML and MM tags have a lot more dynamic range and variability, and aren't as easily compressed.

ADD REPLY • link 3 months ago by Billy Rowell ▴ 510

0

Entering edit mode

I do think compression ratios for the sequences and qualities is much greater than for the rest. I myself was surprised to get fastq.gz files of about 27 GB by two different approaches from a ~280 GB BAM file. I guess they set the sequencer to extra verbose.

For genome assembly, would you think all but the sequences and their quality can be thrown out? This is already what I believe but I want to make sure before I ditch a few TBs of raw data :D

ADD REPLY • link 3 months ago by Eric Normandeau 11k

2

Entering edit mode

A standard Revio output should be ~50-60GB max. Your service provider probably saved consensus kinetics to your BAM as well, which includes 4 more arrays and 2 more ints per read (fi, fp, fn, ri, rp, rn). Check to see if your BAM has both kinetics (the tags in the previous sentence) and basemods (MM, ML). If it has both, removing the kinetics tags will greatly reduce the size of your BAM. You can remove the tags using something like:

samtools view -@7 --bam \
  --remove-tag fi,fp,fn,ri,rp,rn \
  --output out.bam in.bam

ADD REPLY • link updated 3 months ago by Eric Normandeau 11k • written 3 months ago by Billy Rowell ▴ 510

score 5 · Accepted Answer · 2025-01-23

For a true "authoritative" answer, I'd email support@pacb.com. I work at PacBio, but I have no authority.

What additional informations do the BAM files contain compared to the Fastq ones

This part of your question is answered, as suggested by Istvan Albert and GenoMax, by comparing the BAM headers and tags from your BAM with the SAM Spec, SAM optional field spec (for more information on standard SAM tags), and the PacBio BAM spec (for PacBio-specific tags). You can cross correlate the tags from your BAM with their purpose from the specs above. The RG header alone will carry a lot of detailed information about how the data was generated.

A short, non-exhaustive list of the types of data in these SAM tags:

ZMW (zm)
indications of the first and last polymerase base that contributed to the consensus (ws, we)
number of subread passes that contributed to the consensus HiFi read (np, ec)
consensus read quality (rq)
indications for how the hairpin adapters were called and removed (ac, ma)
barcode information (bc, bq, etc etc etc), this information is required to re-demux the dataset if it was demuxed incorrectly
consensus kinetic information, used to identify base modifications (fp, fi, fn, rp, ri, rn)
base modification types, positions, and qualities (MM, ML)

Compared to the BAM, the FASTQ has much less data. It contains only:

the read "query" name. This is actually pretty informative, as it can indicate the instrument type/serial, the movie name, and the ZMW
the HiFi bases
the HiFi base quality

for what kind of analyses [is extra BAM information] useful

tracing provenance of the data
binning or filtering the reads by estimated quality
correcting demultiplexing if it was done incorrectly the first time
identify modified bases

And from the comment on @ialbert's answer:

A standard Revio output should be ~50-60GB max. Based on the file size you mention in the comment on the other answer, your service provider probably saved consensus kinetics to your BAM as well, which includes 4 more arrays and 2 more ints per read (fi, fp, fn, ri, rp, rn). Check to see if your BAM has both kinetics (the tags in the previous sentence) and basemods (MM, ML). If it has both, removing the kinetics tags will greatly reduce the size of your BAM. You can remove the tags using something like:

samtools view -@7 --bam \
  --remove-tag fi,fp,fn,ri,rp,rn \
  --output out.bam in.bam