Hello all,
I'm a bit confused as to what steps are necessary, and what steps are not going to add much benefit. I have 2 jobs to complete for 2 different research groups we support: 1) Germline short variant discovery on whole exome sequencing (WES) data collected from 1 mouse (1 sample in total), and 2) Germline short variant discovery on whole genome sequencing data (WGS) collected from 2 macaques (2 samples in total).
I have written a wrapper that follows the GATK best practices from fastq preprocessing to haplotypecaller with appropriate conditional loops and required files specific to each species and WGS/WES.
According to the GATK workflow - my next steps after running HaplotypeCaller (with --emit-ref-confidence GVCF) in the pipeline are 1) consolidate GVCFs, 2) Joint-calling cohort, and 3) VQSR ("probably the hardest part of the Best Practices to get right").
Considering I have only 1 or 2 samples, and in species where truth data sets may not be available - is it pointless doing some/all of these steps? Should I just stick to the variants called in each sample by HaplotypeCaller? Should I remove "--emit-ref-confidence GVCF" and just create a regular VCF?
Thank you, Kenneth