What additional info is found in BAM files from PacBio HiFi compared to Fastq files
2
0
Entering edit mode
3 months ago

Background:

One of the sequencing facilities we work with used to provide both BAM and fastq.gz files for PacBio HiFi reads. Now they only provide BAM files. Since the BAM files are about 10 times bigger than the fastq.gz ones, I'm considering getting rid of them once I've create the fastq.gz files with samtools fastq.

For these projects, we're doing genome assembly.

Question:

What additional informations do the BAM files contain compared to the Fastq ones and for what kind of analyses are these useful?

bam hifi fastq pacbio • 1.3k views
ADD COMMENT
1
Entering edit mode

Moved this to the answer below

ADD REPLY
0
Entering edit mode

Thank you!

I'd like to accept this as the correct answer if you agree to cut/paste into an answer rather than a comment reply.

ADD REPLY
0
Entering edit mode

PacBio describes their extended BAM format spec here --> https://pacbiofileformats.readthedocs.io/en/13.0/BAM.html

ADD REPLY
0
Entering edit mode

Thanks. I just re-read the document in case I had skimmed it too fast the first time, but it doesn't quite answer the first part of my question and does not provide answers for the second part.

ADD REPLY
1
Entering edit mode

Looks like you will lose information about kinetics/base modifications but if you only care about sequence then fastq sequence may be enough.

I am going to tag Billy Rowell from PacBio. Hopefully we will get an authoritative answer.

ADD REPLY
5
Entering edit mode
3 months ago
Billy Rowell ▴ 510

For a true "authoritative" answer, I'd email support@pacb.com. I work at PacBio, but I have no authority.

What additional informations do the BAM files contain compared to the Fastq ones

This part of your question is answered, as suggested by Istvan Albert and GenoMax, by comparing the BAM headers and tags from your BAM with the SAM Spec, SAM optional field spec (for more information on standard SAM tags), and the PacBio BAM spec (for PacBio-specific tags). You can cross correlate the tags from your BAM with their purpose from the specs above. The RG header alone will carry a lot of detailed information about how the data was generated.

A short, non-exhaustive list of the types of data in these SAM tags:

  • ZMW (zm)
  • indications of the first and last polymerase base that contributed to the consensus (ws, we)
  • number of subread passes that contributed to the consensus HiFi read (np, ec)
  • consensus read quality (rq)
  • indications for how the hairpin adapters were called and removed (ac, ma)
  • barcode information (bc, bq, etc etc etc), this information is required to re-demux the dataset if it was demuxed incorrectly
  • consensus kinetic information, used to identify base modifications (fp, fi, fn, rp, ri, rn)
  • base modification types, positions, and qualities (MM, ML)

Compared to the BAM, the FASTQ has much less data. It contains only:

  • the read "query" name. This is actually pretty informative, as it can indicate the instrument type/serial, the movie name, and the ZMW
  • the HiFi bases
  • the HiFi base quality

for what kind of analyses [is extra BAM information] useful

  • tracing provenance of the data
  • binning or filtering the reads by estimated quality
  • correcting demultiplexing if it was done incorrectly the first time
  • identify modified bases

And from the comment on @ialbert's answer:

A standard Revio output should be ~50-60GB max. Based on the file size you mention in the comment on the other answer, your service provider probably saved consensus kinetics to your BAM as well, which includes 4 more arrays and 2 more ints per read (fi, fp, fn, ri, rp, rn). Check to see if your BAM has both kinetics (the tags in the previous sentence) and basemods (MM, ML). If it has both, removing the kinetics tags will greatly reduce the size of your BAM. You can remove the tags using something like:

samtools view -@7 --bam \
  --remove-tag fi,fp,fn,ri,rp,rn \
  --output out.bam in.bam
ADD COMMENT
1
Entering edit mode

After removing these tags, the BAM file size went from 158 GB to 18 GB. Thanks again. Should I contact the sequencing service to let them know end users would be better served if they never received these tags? Is there any possible use for these info except potentially helping debug the sequencers?

ADD REPLY
2
Entering edit mode

I think the general thinking is that these kinetics tags will allow you to recall basemods if our current 5mC caller changes or if we add new basemod callers in the future. So it's basically for future-proofing.

ADD REPLY
1
Entering edit mode
3 months ago

The proper answer to this is that you have to look at the tags in the BAM file, those will explain what additional information is stored for each read. For example in one file, I see:

RG:Z:797d41a8   ac:B:i,60,0,36,24   ec:f:55.0811    ma:i:0  np:i:60 rq:f:0.999997   sn:B:f,9.44502,13.8063,3.47653,6.21549  we:i:10754634   ws:i:98819  zm:i:46465666
MM:Z:C+m,21,55,4,3,12,1,2,0,0,4,4,1,2,0,2,1,0,0,3,0,0,6,3,7,0,0,4,5,2,2,3,2,18,4,6,18,9,4,3,8,8,9,6,3,5,3,17,9,1,5,7,6,2,13,3,13,6,3,1,0,15,2,4,7,0,6,5,3,4,5,1,1,7,6,10,19,10,0,1,3,1,2,1,6,2,40,9,4,4,0,6,0,2,9,6,1,5,2,6,2,30,3,9,8,5,0,0,0,0,0,0,0,0;   
ML:B:C,5,19,213,3,16,0,0,0,0,3,30,137,9,20,4,5,8,2,0,8,18,100,8,14,62,230,12,3,23,135,89,201,12,2,77,4,16,185,28,8,1,18,16,215,3,17,10,6,5,75,22,123,2,7,50,42,207,1,8,0,0,23,9,117,3,21,125,13,137,54,109,31,20,81,12,15,34,1,66,8,3,17,2,19,134,193,2,1,0,2,4,0,199,2,129,10,1,11,13,111,34,7,2,61,1,11,3,26,6,0,4,1,10

it seems the largest piece of information is that of methylation tags.

I will say, though, that I was surprised that the size difference is 10x.

The additional information seems to be less than 2x more, but perhaps the answer here is that the FASTQ file compresses much better due to containing the same information.

ADD COMMENT
1
Entering edit mode

Bases (N=4) and binned base quality scores (N=~7) are highly compressible. ML and MM tags have a lot more dynamic range and variability, and aren't as easily compressed.

ADD REPLY
0
Entering edit mode

I do think compression ratios for the sequences and qualities is much greater than for the rest. I myself was surprised to get fastq.gz files of about 27 GB by two different approaches from a ~280 GB BAM file. I guess they set the sequencer to extra verbose.

For genome assembly, would you think all but the sequences and their quality can be thrown out? This is already what I believe but I want to make sure before I ditch a few TBs of raw data :D

ADD REPLY
2
Entering edit mode

A standard Revio output should be ~50-60GB max. Your service provider probably saved consensus kinetics to your BAM as well, which includes 4 more arrays and 2 more ints per read (fi, fp, fn, ri, rp, rn). Check to see if your BAM has both kinetics (the tags in the previous sentence) and basemods (MM, ML). If it has both, removing the kinetics tags will greatly reduce the size of your BAM. You can remove the tags using something like:

samtools view -@7 --bam \
  --remove-tag fi,fp,fn,ri,rp,rn \
  --output out.bam in.bam
ADD REPLY

Login before adding your answer.

Traffic: 4194 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6