Hello,
I am using plink2 1000Genome files, converting them to VCF, using CrossMap to convert positions to hg19, and then converting back to plink format. I noticed that there are variants (n=9,598 out of 55,307,498) which have the same chromosome, position, and alleles... but occasionally with inconsistent genotypes.
My best guess of what is happening: These variants were mapped to different positions in hg38 and then mapped to the same position when lifting over. However, I am curious how odd of an observation this is. With such a small number, perhaps my concerns are unwarranted. But given the somewhat strict variant-calling process, I would expect duplicate variants ---caused by remapping---to have the exact same genotypes.
Thank you!
It does happen occasionally and, with the UCSC chain file
hg38ToHg19.over.chain.gz
it happens more often than with thehg19ToHg38.over.chain.gz
file, but it nevertheless happens both ways. This might be related to regions that are under-represented in one assembly or vice-versa over-represented in the other assembly. Notice that you will also have the issue of variants mapping to unplaced contigs. My best suggestion is to use hg38. What resource in particular is pulling you to hg19?I am conducing coloc analysis between a GWAS dataset and several genotype datasets. The GWAS does not have rsIDs and the A1/A2 alleles don't follow the assembly (Often they are flipped, which means A1/A2 represents major/minor for the GWAS cohort rather than the positive/negative strand of the assembly).
Either way, I need to convert the GWAS to hg38 or convert the genotype data to hg19. Most of what I could find about the subject highlights liftover with genotype data... so that is what I did.
From what you are saying A1/A2 alleles might be swapped, that is one will still match the reference but you don't know which one. The term flipped is better left to indicate whether the alleles were reverse complemented because of strand mismatches which is a different issue. If the alleles are just swapped rather than flipped, then at least for SNVs you should be able to handle this with BCFtools/munge (for most indels it is impossible to figure out which one was the reference allele from the two alleles alone) and then you can convert the output to hg38 with BCFtools/liftover. Disclaimer: I wrote these tools
surely there is already a hg19 mapped VCF of 1kg data. Have you looked at ALL.2of4intersection.20100804.genotypes.vcf.gz ?
The
ALL.2of4intersection.20100804.genotypes.vcf.gz
file only includes 629 samples and it is from the original low coverage version of the 1000 Genomes project, not the better quality high coverage versionhttps://hgdownload.soe.ucsc.edu/gbdb/hg19/1000Genomes/phase3/
Phase 3 is the last version of the low coverage released in 2013. The high coverage uses different sequencing data and was initially released in 2020. See Table 1 here for some additional explanation of the differences between the two callsets