I want to obtain a VCF file containing genotype calls and their scores for every rsID, whether or not a variant was called. I was planning to use the following steps:
- HaplotypeCaller
-genotyping_mode DISCOVERY --output_mode EMIT_VARIANTS_ONLY --emitRefConfidence BP_RESOLUTION
as shown above awk '{ if ( $3 != "." ) { print $0; } }' variants.vcf > variants.filtered.vcf
- GenotypeGVCFs
--includeNonVariantSites
Using the most recent dbSNP download here: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/All.vcf, I ran this:
GATK -T HaplotypeCaller \
--reference_sequence GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
--input_file recalibrated.bam \
--dbsnp current_dbsnp/All.vcf.gz \
--genotyping_mode DISCOVERY \
--output_mode EMIT_VARIANTS_ONLY \
--emitRefConfidence BP_RESOLUTION \
--out variants.vcf
However, the ID column only contains ".". How can I get HaplotypeCaller to populate the ID column with rsIDs? Also, is there a better way to get variant and non-variant genotype calls with HaplotypeCaller?
Thanks.
That's concerning, then. I thought VCF was always 1-based.
However, I don't think that's the issue, since, with
BP_RESOLUTION
, literally every position is called (chr1:1, chr1:2, chr1:3, ...).