Hi,
I had used UCSC liftOver standalone tool to convert positions across builds, but I read that this tool is not ideal for single variants. Is this true?
Since this is such a common task, I was wondering whether there are improved tools these days suited specifically for single variants.
I do not have rsIDs for all my variants (otherwise it seems as the best practice would be to use the corresponding dbSNP version to do this), but I only have chr:pos and allele 1 and allele 2. I do not have the frequency of these alleles in the population but assuming the population is EUR.
I found this bcftools plugin interesting (https://github.com/freeseek/score#liftover-vcfs), which claims to avoid limitations of other liftover tools.
Thank you in advance for any insight.
wow thank you Giulio, so definitely not a good idea to convert through rsIDs... So now I would not go through rsIDs anymore when trying to liftOver but always use only chr:pos...
I was using the chain file from UCSC http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz but I see from your link that there are different ones actually. Which one do you suggest? I will read more.
This chain file, which you and Giulio have mentioned, is exactly that same chain file that I mentioned indirectly in my answer, if you may follow the link that I provided.
For a liftover from hg19 to hg38 I have observed that the UCSC chain file drops ten times less variants than the Ensembl chain file, so it is the one I would recommend using.
question for Giulio, do these issues also happen within the same dbSNP build - specifically for my task, I am wondering if, when the exact dbSNP build used to map positions to rsID is known, it is indeed ok to use that dbSNP built to liftOver variants.
I think there are lingering issues. Take STRs as an example. An rsID will sometimes not correspond to a pair of alleles but to a set of alleles. And I don't see how you automatically get a map of which alleles correspond to what across genome assemblies just by looking up the rsID. Take rsID rs10565182 (previously rs66483207). It has the following VCF representations in hg19 and hg38:
But can you evince this from the dbSNP record?
Thank you Giulio. There seems to continue being differences between tools, for example using BCFtools/liftover a SNP that is in 37 build at 10:102493680 with BCFtools/liftover becomes 10:100733922 while with other tools it is 10:100733923. Same for 10:10152460, with BCFtools/liftover is 10:10110496 while with other tools it is 10:10110497, or 10:106392798 with BCFtools/liftover is 10:104633037 while with other tools it is 10:104633040, and several others in my data.
I have tried and, at least at the VCF level, I get something different from what you suggest. Assuming you are talking about SNP rs72553532:
Thanks Giulio, Ok this difference at least for the particular SNP we are looking at seems to be dependent on the alleles: I am trying to recover the 38 position for this SNP which is an INDEL and has alleles A>-,AA. Specifically, I have the following alleles for that position, which are different from your test above:
. .
I also tried normalizing first using bcftools norm -m+ but the alignment becomes 10 102493679 chr10:100733923 CA C . . . with the same result.
You can try with the option
--no-left-align
:Notice that this is the same as variant:
However, the normalized version is what you get without the
--no-left-align
option. See here for a thorough explanation for what normalized VCF records areOk but the result is the same as before for the lifted position (maps to chr10:100733922 and not to chr10:100733923), is there anyway I can get it to map to the position chr10:100733923?
Same for other SNPs like 10:106392798 with BCFtools/liftover is 10:104633037 while with other tools it is 10:104633040:
As it is a deletion, it does not map to one base pair. Would you say that it is the A at position 100733923 or the A at position 100733924 that got deleted? You cannot say as either deletion gives you the same variant. To get the kind of coordinate you want you would need a post-processing of the output VCF that puts the VCF records back in the non-normalized format you are interested in. You might be able to get away with option
--no-left-align
and a simple post-processing scriptI see, so this mismatch would happen only when the position in question overlaps INDELs. Thank you.