Why the optimal value of BQSR intervals is 20
1
0
Entering edit mode
7.2 years ago

I see a hint in SevenBridges "BQSR intervals optimal value is 20 or chr20 *".

I tried running GATK BaseRecalibrator with and without -L 20, the result files are nearly the same. Why? We don't need to do BQSR in chromosomes other than chr20?

wgs gatk • 2.9k views
ADD COMMENT
0
Entering edit mode

Here is my command with -L 20

java -Xmx50000M -jar GenomeAnalysisTK-3.5-0-g36282e4/GenomeAnalysisTK.jar --analysis_type BaseRecalibrator -nct 48 --out CCLE-HCC1143-DNA-10_Illumina.converted.sorted.deduped.recal_L20data.grp --disable_indel_quals --reference_sequence human_g1k_v37_decoy.fasta --input_file CCLE-HCC1143-DNA-10_Illumina.converted.sorted.deduped.bam --knownSites dbsnp_137.b37.vcf --knownSites 1000G_phase1.indels.b37.vcf --knownSites Mills_and_1000G_gold_standard.indels.b37.sites.vcf -L 20

ADD REPLY
1
Entering edit mode

Do yourself a favor and omit the BQSR step. There were some papers out on the last years, as well as comments here on Biostars stating that BQSR has little to no effect on variant calling. Just use the search function here to get some details. EDIT: The more I searched around, I also find others stating that BSQR is beneficial, so I have to relativize my above comment.

ADD REPLY
0
Entering edit mode

Thank you for your reply. As a newbie, I find the GATK Best Practices recommends people to do BQSR in https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS

ADD REPLY
0
Entering edit mode

Hi bluemonster0808,

I see a hint in SevenBridges "BQSR intervals optimal value is 20 or chr20 *".

Do you have a source for this quotation? A quick scour of the World Wide Web using Google reveals to me that the quote as you've put it doesn't exist (?), but maybe you heard it from a colleague or SevenBridges white paper?

ADD REPLY
0
Entering edit mode

https://ibb.co/cT0if6

you can see the screenshot above

I save it on a free image host

ADD REPLY
1
Entering edit mode
7.0 years ago

Since this question has returned to the top, I'll give it a shot answering...

Base quality recalibration as implemented in the GATK interface consists of two steps. The first step collects statistics about biases, the second step actually edits the bam records to recalibrate them.

The first step (GenomeAnalysisTK.jar -T BaseRecalibrator ...) doesn't require the entire genome. In fact, one chromosome may suffice to collect enough data to have accurate statistics (see also Downsampling to reduce time) hence the suggestion to use -L chr20. In turn, this explains why without -L, i.e. using the entire genome, you get effectively the same results.

ADD COMMENT
0
Entering edit mode

This sounds reasonable

Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2066 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6