Hi,
I have a general question regarding truth sets that need to be used for BQSR step in the GATK workflow. I am aware that a lot of variant datasets (SNPs and Indels) from phase 1 of 1000 genomes project are being currently used for this, but the consortium has come up with phase 3 variants as well. Their biallelic SNVs and Indels are present here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.wgs.shapeit2_integrated_snvindels_v2a.GRCh38.27022019.sites.vcf.gz
Will it be okay to use this instead of phase 1 datasets that can be seen here (SNPs and Indels)? - https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0;tab=objects?pli=1&prefix=&forceOnObjectsSortingFiltering=false
Would like to know what the community thinks about this.
There is also this dataset-ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz, but his has multiallelic variants, structural variation, etc, and hence, I won't be using it.
Regards, Prasun