I have vcf files that I want to convert into .bed files with plink to use for proxy search. One issue I am having is that each variant id must be unique. In these vcf's, the multi-allelic variants are formatted as bi-allelic records. Here is an example :
tabix gnomad.genomes.v3.1.2.hgdp_tgp.chr6.vcf.bgz chr6:29440751-29440751 | cut -f 1-5
chr6 29440751 rs2074464 A C
chr6 29440751 rs2074464 A G
chr6 29440751 rs2074464 A T
I know that with bcftools, you can simply keep the first occurrence of a variant with -d
, but this is problematic for LD calculations. I would like to be able to ensure that the record that gets preserved has the highest allele frequencies of the all the records with that ID, not simply the first occurrence of the variant. This way I will have a better chance of having high r2 values when I calculate LD between this multi-allelic variant and another variant. Is this possible?
I think it can be done with a little work, but first an easier option - can you just collapse the biallelic sites to multiallelic sites, so that you have unique IDs? Does your downstream software support multiallelic sites?
You can collapse sites this way with bcftools, vcflib, or other tools, it's pretty standard.
If not, somebody may cook up more complex solution..