Dear community members,
I have a lot of variants for genotyping (>6 millions) and a lot of WGS samples (represented as BAM and VCF files).
My strategy for genotyping before was to read the list of variants and then iterate through VCF files, using a custom Python script. However I anticipate it will work very slow for such a huge number of samples.
Is there a way to quickly genotype a huge WGS cohort? Should I use BAM or VCF files for that?
Another issue is that VCF are called in GRCh38 and the variants for genotyping are in hg19, so for some variants where reference allele was changed in GRCh38 VCFs could be not enough, but this is a minor problem...
liftover your list of variants, split your list of variants per regions of XXX variants to call the BAMs in GVCF mode with GATK. Combine and Genotype the GVCFs , concatenate each region. use a workflow manager to run everything in parallel.
Thanks a lot! I am not very used to GATK infrastructure, but I guess it is time to learn =)