Dear all,
I am analyzing a DNA-seq data to identify variants in two genes of human using a single-sample based gatk4 pipeline (https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-).
Haplotypecaller identified known variants that start with rs
such as rs11490246
. Later, I applied GenotypeGVCFs.
After applying GenotypeGVCFs
, there was no known variants that were previously found at HaplotypeCaller
step in output of GenotypeGVCFs
.
Here is my all pipeline after generating analysis-ready bam files. This is for one bam of a single sample. I don't have multiple bam files from the same sample, which is why I followed single-sample approach at the website (https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-).
Variant calling
HaplotypeCaller
gatk --java-options "-Xmx5g" HaplotypeCaller -R hg19.fa -I 21050.AnalyseReady.bam -O 21050.HaplotypeCaller.output.vcf.gz -D dbsnp_138.hg19.vcf -ERC GVCF
GenotypeGVCFs
gatk --java-options "-Xmx4g" GenotypeGVCFs -R hg19.fa -V 21050.HaplotypeCaller.output.vcf.gz -O 21050.HaplotypeCaller.output.g.vcf.gz.GenotypeGVCFs.output.vcf.gz --tmp-dir=./Temp_files
CNNScoreVariants
gatk CNNScoreVariants -V 21050.HaplotypeCaller.output.g.vcf.gz.GenotypeGVCFs.output.vcf.gz -R hg19.fa -O 21050.HaplotypeCaller.output.NEW.vcf.gz.CNNScoreVariants.OUT.annotated.vcf
FilterVariantTranches
gatk --java-options "-Xmx7g" FilterVariantTranches -V 21050.HaplotypeCaller.output.NEW.vcf.gz.CNNScoreVariants.OUT.annotated.vcf --resource Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.EDITED.gz --info-key CNN_1D --snp-tranche 99.95 --indel-tranche 99.4 -O 21050.FilterVariantTranches.output.cnns.cnnfilter.vcf
Am I missing something? or Should I skip GenotypeGVCFs? I could not make sure about the GenotypeGVCFs step in single-sample variant call pipeline.