Hi,
I recently found about gkno and I'm getting familiar with it. I have limited experience using GATK and I was wondering why is gkno marking "-knownSites" option required in "gatk-count-covariates" tool. As far as I can tell while strongly advised this option is marked as optional in upstream[[1]]. I'm working with bacterial genomes with no known SNPs database to use. I guess I could skip quality recalibration all together, but I feel this would be far from ideal.
Thanks,
Carlos
Thanks, for your answer. I was thinking on the workaround of the empty(from what you say I need at least one entry) vcf file.
One thing does confuse me and it could be because of my limited experience with GATK. From the link I include above, in table "BaseRecalibrator specific arguments", '-knownSites' is marked as optional. However you say is required, are you saying GATK documentation is mislabeling this option?
I kept thinking about my options. What if I do a first past without recalibration and generate a vcf file with a very stringent set of variants. I then use this set as my "knownSites". Would I be introducing bias? If I understood covariates correctly, as long as there is a good representation of the sample I should be fine, right? Do you have recommendations on stringent variant qualifiers to build the initial set of variants? Thanks.
This is actually what we recommend doing in case you do not have known sites available for your organism. You can repeat this "loop" (generate high confidence variants, use them to recalibrate the original bam file, generate a new set of high confidence variants) several times to refine the set of high confidence variants, for best results.
Great to get confirmation for this approach. Is there a good link where I could read how you recommend doing this?
Hi Carlos,
This is an issue of interpretation of the documentation. What we indicate as required in the documentation is what is required for the program to run from a technical standpoint. It is technically possible to run BaseRecalibrator without known sites. However, it is extremely inadvisable to do so from an analytical standpoint, because of the assumptions that the algorithm relies on. We try to make this clear in the documentation that describes how to use the tools.
Thanks, for the help, it is quite useful.