Query regarding callsets used as known sites in Variant Calling
0
0
Entering edit mode
13 months ago

Hi,

Where can I learn more about the standard VCF files that are used as known sites during the BQSR step in Variant Calling with GATK? The files are:

  • Homo_sapiens_assembly38.dbsnp138.vcf
  • Homo_sapiens_assembly38.known_indels.vcf.gz
  • Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

I am aware that these files are available in the GATK resource bundle but I wanted to know more about how these files were prepared. It would be of great help if you could point out any scientific literature documenting the creation of these files.

Are these call sets derived from sample alignment or mapping with the GRCh38 primary assembly or the full assembly including ALT loci? Or are there separate VCF files including the ALT loci?

Thanks.

GATK VCF • 719 views
ADD COMMENT
0
Entering edit mode

You can always look at the VCF header; it should indicate which reference sequences were used for alignment. It would be very bad practice to do variant-calling on a different assembly than alignment.

dbSNP is here: https://www.ncbi.nlm.nih.gov/snp/ I imagine they filtered that to retain only sites with >X AF, where X could be determined by looking at the lowest AF in the VCF.

ADD REPLY
0
Entering edit mode

Hey,

Thanks for the suggestion. I did look at the header of all the VCF files. It does include the alt and decoy contigs.

The files are generated after using multiple GATK tools/commands. The ref sequence indicated is human_g1k_v37.fasta.

So I guess this is a 'liftover' of variant calls from GRCH37 to GRCh38, but I am not sure. Do you have any idea about this?

ADD REPLY
0
Entering edit mode

That does sound likely... unfortunately, liftovers are going to vary the most in the unplaced / alt contigs. I consider those contigs suspect anyway, though; some do not even appear to be human, but rather contaminant or foreign DNA derived from infections.

ADD REPLY
0
Entering edit mode

Yes, on the GATK forum, the members did confirm that the VCF files are lifted over to hg38. And yes, as you mentioned, variants from liftovers will vary based on the alignment between the two ref genomes.

Either way, at the moment, this resolves the confusion of whether the resource VCF files are suitable for GRCh38. Thank you very much for your time and help.

ADD REPLY

Login before adding your answer.

Traffic: 2231 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6