Question

Why Is "-Knownsites" Option Required In Gkno'S "Gatk-Count-Covariates" Tool?

2

Entering edit mode

12.0 years ago

Carlos Borroto ★ 2.1k

Hi,

I recently found about gkno and I'm getting familiar with it. I have limited experience using GATK and I was wondering why is gkno marking "-knownSites" option required in "gatk-count-covariates" tool. As far as I can tell while strongly advised this option is marked as optional in upstream[[1]]. I'm working with bacterial genomes with no known SNPs database to use. I guess I could skip quality recalibration all together, but I feel this would be far from ideal.

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html

Thanks,
Carlos

gatk • 5.0k views

ADD COMMENT • link updated 11.0 years ago by Biostar 20 • written 12.0 years ago by Carlos Borroto ★ 2.1k

score 3 · Answer 1 · 2013-05-14

3

Entering edit mode

12.0 years ago

alistairnward ▴ 210

Unfortunately, the '-knownSites' option is required, not optional. As previously mentioned, when GATK marches through the BAM file, it assumes that any mismatch with the reference is an error. If the mismatch is a known variant (i.e. it is in dbSNP), GATK ignores the site and doesn't use it in generating covariates. Removing the recalibration step is definitely an option, alternatively, you could generate a vcf file that contains a single SNP, just to fulfill requirements. If you need assistance, modifying a pipeline, please let us know. Unfortunately, we are not the authors of GATK and as such, we cannot modify the requirements for that tool.

ADD COMMENT • link 12.0 years ago by alistairnward ▴ 210

0

Entering edit mode

Thanks, for your answer. I was thinking on the workaround of the empty(from what you say I need at least one entry) vcf file.

One thing does confuse me and it could be because of my limited experience with GATK. From the link I include above, in table "BaseRecalibrator specific arguments", '-knownSites' is marked as optional. However you say is required, are you saying GATK documentation is mislabeling this option?

ADD REPLY • link 12.0 years ago by Carlos Borroto ★ 2.1k

0

Entering edit mode

I kept thinking about my options. What if I do a first past without recalibration and generate a vcf file with a very stringent set of variants. I then use this set as my "knownSites". Would I be introducing bias? If I understood covariates correctly, as long as there is a good representation of the sample I should be fine, right? Do you have recommendations on stringent variant qualifiers to build the initial set of variants? Thanks.

ADD REPLY • link 12.0 years ago by Carlos Borroto ★ 2.1k

0

Entering edit mode

This is actually what we recommend doing in case you do not have known sites available for your organism. You can repeat this "loop" (generate high confidence variants, use them to recalibrate the original bam file, generate a new set of high confidence variants) several times to refine the set of high confidence variants, for best results.

ADD REPLY • link 12.0 years ago by vdauwera ★ 1.2k

0

Entering edit mode

Great to get confirmation for this approach. Is there a good link where I could read how you recommend doing this?

ADD REPLY • link 12.0 years ago by Carlos Borroto ★ 2.1k

0

Entering edit mode

Hi Carlos,

This is an issue of interpretation of the documentation. What we indicate as required in the documentation is what is required for the program to run from a technical standpoint. It is technically possible to run BaseRecalibrator without known sites. However, it is extremely inadvisable to do so from an analytical standpoint, because of the assumptions that the algorithm relies on. We try to make this clear in the documentation that describes how to use the tools.

ADD REPLY • link 12.0 years ago by vdauwera ★ 1.2k

0

Entering edit mode

Thanks, for the help, it is quite useful.

ADD REPLY • link 12.0 years ago by Carlos Borroto ★ 2.1k

score 0 · Answer 2 · 2013-05-13

0

Entering edit mode

12.0 years ago

Pierre Lindenbaum 166k

from http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html

It does a by-locus traversal operating only at sites that are not in dbSNP. We assume that all reference mismatches we see are therefore errors and indicative of poor base quality.