Question

Base recalibration in normal vs. tumor somatic variant calling in WXS data?

4

Entering edit mode

3.5 years ago

rebeliscu ▴ 60

Hi there,

I have a tumor and a normal BAM file and am preparing to run base recalibration.

I was planning on calling variants on the normal and using that, in addition to dbSNP, as input for recalibration of tumor BAM(s), e.g.:

gatk BaseRecalibrator \
  -I tumor.bam \
  -R hg38.fasta \
  --known-sites normal.vcf  \
  --known-sites dbSNP_hg38.vcf  \
  -O tumor_recal.table

Before producing the normal VCF however, it's not clear to me whether I should run base recalibration on the normal BAM. If this is advised, I had planned using dbSNP as the known sites (for normal), e.g.:

gatk BaseRecalibrator \
  -I normal.bam \
  -R hg38.fasta \
  --known-sites dbSNP_hg38.vcf  \
  -O normal_recal.table

Alternatively, I could keep things simple and run base recalibration on both tumor and normal using dbSNP only.

Is one of these workflows more preferable? Any clarity here would be much appreciated. Thanks!

WXS recalibration variant somatic • 4.7k views

ADD COMMENT • link updated 20 months ago by aldhairmedico ▴ 70 • written 3.5 years ago by rebeliscu ▴ 60

0

Entering edit mode

3.5 years ago

aldhairmedico ▴ 70

Hi, If you read the GATK Best Practices forums and posts about BaseRecalibrator you will find that the purpose of "calibrating" BAMs is to correct sequencing errors. In many patterns of nucleotides like AAG, the third nucleotide after a repetition tends to be overestimated in the PHRED score. If you don't calibrate these seq errors you can get variant calls that pass the hard filters in a position with an actual low PHRED score. So, no matter if you are processing normals or tumors you should perform this step in all your samples.

Hope this is helpful.

ADD COMMENT • link 3.5 years ago by aldhairmedico ▴ 70

0

Entering edit mode

Thanks for your response.

I guess I intuited that the normal would need to be recalibrated. Additionally, it's not clear to me: should I recalibrate the tumor BAMs using the normal as a "known sites" input (i.e. recal on normal using dbSNP, call variants, use as input to recal tumor) or recalibrate them all the same way, i.e. just using dbSNP? Hopefully that makes sense.

ADD REPLY • link 3.5 years ago by rebeliscu ▴ 60

0

Entering edit mode

Yes, as Cyriac answered before, all samples should be calibrated because no conditions or comparisons are defined at this point. The goal of --known-sites is to mask those polymorphic regions. Otherwise, GATK can penalize those positions by misinterpreting actual variation with sequencing errors. That's why I try to use the most updated version for it, you can check the Ensembl FTP for 1000G and dbsnp+havana vcfs. Just be aware that BaseRecalibrator can fail if it finds symbolic symbols in the reference alleles. Error: htsjdk.tribble.TribbleExpection: The provided VCF file is malformed at approximately line number 5880: Duplicate allele added to VariantContext: GT

ADD REPLY • link 20 months ago by aldhairmedico ▴ 70

0

Entering edit mode

3.5 years ago

tomas4482 ▴ 430

No matter what kind of samples and variants you need to deal with, the preprocessing pipeline need to be done, which includes MarkDuplicatesSpark - Base Quality Recalibration and Apply recalibration. For RNA-seq, an additional SplitNCigarReads is needed before BSQR as well.

Only after applying recalibration to your bam, it could be further taken as input to detect somatic or germline variants.

ADD COMMENT • link 3.5 years ago by tomas4482 ▴ 430

score 6 · Accepted Answer · 2021-10-22

6

Entering edit mode

3.5 years ago

Cyriac Kandoth 6.1k

The current "best-practice" is to always do BQSR with the latest (and largest) dbSNP VCF on all samples - tumor or normal, FFPE or blood, etc. Per discussions in this post and this post, BQSR can benefit slightly if you provide a "bootstrap of known variants" unique to your samples, either somatic/germline variants found in your tumor/normal. However, you are effectively running your primary analysis pipeline twice (which is overkill), potentially amplifying false-positive variants (from your first-pass variant list), and potentially breaking compatibility of your BAMs with secondary analysis pipelines (e.g. downstream false-positive filters that use BQ).

There are also recent arguments against using BQSR at all, and instead flagging false-positives based on Base Quality drop-off (at the ends of reads, strand-bias, etc.). You can find an old Perl script here that implements such BQD filters. I also just found this tweet from Geraldine of GATK saying they're thinking of dropping BQSR from the best-practices - presumably because the high computational-expense of BQSR is not reasonable when the quality of DNA sequencing has improved so much. I would still recommend BQSR when re-analyzing old FASTQs, or when comparing FASTQs from a mix of different sequencers (e.g. HiSeq and NovaSeq).

ADD COMMENT • link 3.5 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

Hi Cyriac, thanks so much for your response, this is very helpful. To be clear, doing BQSR with, for example, dbSNP, would not hurt the output of your analyses so much as the computational aspect is cumbersome, yes?

ADD REPLY • link 3.5 years ago by rebeliscu ▴ 60

0

Entering edit mode

"hurt" is relative. :) Waiving the computational expense, doing BQSR gives you a decent balance between variant detection sensitivity and specificity. But if you care more about sensitivity than specificity, then BQSR will hurt your analysis. See more here.