Hello,
I am currently trying to build a germline variant calling pipeline using GATK. One step is the Variant quality score recalibration. For this I need high confidence SNP and INDELS so I can train the model.
GATK offers these SNP and INDELS for the latest reference genome, but not for the one that I am using (GRCh37). I read that the vcf files from the 1000genome project contains only high confident germline mutation calls and think that it might be suitable for my purpose.
So my question is, if any of you know where I could download a VCF file which contains all of the SNPs and INDELs of the phase 3 1000 genomes project.
Is this possible to download the individual chromosomes from here and then combine them? I am afraid that this resource does not only contain "high confidence" variants. I think this might be the case, because combining these vcf files would result in a gigantic vcf file. However, the GRCh38 "gold standard high confidence snp" vcf from GATK is only 7 GB big when uncompressed.
I would be very grateful for any suggestions or links where I can download the data that I am looking for.
Cheers
This is great, thank you very much.
Will it be a problem that the GATK used the b37 reference genome and i am using the hs37d5.fa (GRCh37)? The included contigs are quite different.
Do you also by chance know where I can get something similar for the indels?