Hi everyone. I recently started working with DNA whole genome sequencing for variant calling with GATK 4.0.
I am working on a fish where I donĀ“t have a database of know SNPs nor of indels. I have a total of 394 individuals. This means that I have 394 WGS samples and I would like to use the GVCF workflow.
According, to what I have read, I need to create such lists (known SNPs and indels) with my own data. However, I have a couple of questions regarding the pipeline to achieve this.
1) In order to generate my list of SNPs and INDELs that will be provided as input for Base Quality Score Recalibration, should I use the Haplotypecaller in normal mode (where I get a .vcf file)? Or should I use the GVCF mode in this first round of the Haplotypecaller (where I get a g.vcf file)?
2) Since this first Haplotypecaller round will be done per sample, at the end I will have a total of 394 output files. Should I combine them all together and keep only the high quality variants, so that at the end I have only one file of SNPs and one of INDELs to use for all the 394 samples? Or should each sample be recalibrated with its own set of SNPs and INDELs?
Many thanks to all of you for your help and support.
Lidia