First of all, thank you for a very clear Q and also for all the efforts you have made for nice formatting.
If you read the description of known
on the VQSR page (https://gatk.broadinstitute.org/hc/en-us/articles/360036351392-VariantRecalibrator)
--resource / -resource
Known - The program only uses known sites for reporting purposes (to indicate whether variants are already known or novel)
So, known sites are used only to nominate if a site has already been reported to be found elsewhere (known or novel). In this respect, it is imperative to take dbsnp
as the most qualified resource, as they collate snp information from all other sources. See these
https://www.internationalgenome.org/faq/are-the-igsr-variants-available-in-dbsnp/
Which One Should I Use Hapmap Or 1000Genome Or Dbsnp?
Other SNP resources could be used for 'truth' and 'training' sets, which is explained in the 2nd link that you posted
A training set resource is a list of variants that is used by
machine-learning based algorithms to model the properties of true
variation vs. artifacts. This requires a higher standard of curation
and validation of the variants that are included in the resource.
Tools that take such a resource typically accept a parameter that
indicates your degree of confidence in the resource. This type of
resource is difficult to bootstrap, as it benefits greatly from
orthogonal validation (e.g. through a different technology such as
arrays or Sanger sequencing).
A truth set resource is a list of variants that is used to evaluate
the quality of a variant callset (e.g. sensitivity and specificity, or
recall). As such this requires the highest standard of validation, and
tools that take such a resource will assume all variant calls it
contains are true variation. This cannot be bootstrapped and must be
generated using orthogonal validation methods.
As you see, the training set requires a higher degree of confidence and the truth set requires the highest degree of confidence - so you can choose them according to your confidence level.
For example, calibrating exome SNP data, different resources have been used for training and test sets. (see the link above)
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.sites.vcf.gz \
--resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg38.sites.vcf.gz \
--resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 Homo_sapiens_assembly38.dbsnp138.vcf.gz \
Using `dbsnp' has another advantage that it collects SNPs from other species. So you have single 'known' database for SNPs in all species.
BQSR
The --known-sites
argument is used a little bit differently in BQSR
(https://gatk.broadinstitute.org/hc/en-us/articles/360036898312-BaseRecalibrator#--known-sites)
--known-sites / NA
One or more databases of known polymorphic sites used to exclude regions around known polymorphisms from analysis. This
algorithm treats every reference mismatch as an indication of error.
However, real genetic variation is expected to mismatch the reference,
so it is critical that a database of known polymorphic sites is given
to the tool in order to skip over those sites. This tool accepts any
number of Feature-containing files (VCF, BCF, BED, etc.) for use as
this database. For users wishing to exclude an interval list of known
variation simply use -XL my.interval.list to skip over processing
those sites. Please note however that the statistics reported by the
tool will not accurately be reflected those sites skipped by the -XL
argument.
So here you would give any/all sites, which has potential to be a SNP
Thank you so much for the detailed response Santosh Anand. I really do appreciate the clarity :).
Happy that it was helpful :-)