I have a relatively small sample of human exomes (n=11) that I would like to call SNPs for, using the GATK pipeline. From reading the GATK documentation, it seems that the best way to do this is to use many exomes as "background" for the genotype calling and refinement steps. My lab, however, is very new to exome-seq, and we only have the 11 we generated on hand.
Is there a database somewhere of exome data that I could use as the background? I think gVCFs would be preferable, but could potentially work from fastq or bam if necessary. We have access to the UKBiobank, but it seems like there were some issues with their exome data that might dissuade me from using their gVCFs. If there isn't available exomes, would there be a problem with using genomic data (like the gVCFs available from HGDP) for this step?
GATK should offer resources if they recommend something in their pipeline. Can you show us the page where they make this recommendation?
Yeah I guess they don't explicitly say it, but here they definitely seem to elude to having a large cohort, but also several people at my institute have recommended that you run it with other background exomes. I think this makes sense since the refinement is a machine learning-based method.
Take a look at this page: https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle