Current best practive to liftOver many SNVs from hg19 to GRCh38
2
0
Entering edit mode
3 months ago
User 7754 ▴ 270

Hi,

I had used UCSC liftOver standalone tool to convert positions across builds, but I read that this tool is not ideal for single variants. Is this true?

Since this is such a common task, I was wondering whether there are improved tools these days suited specifically for single variants.

I do not have rsIDs for all my variants (otherwise it seems as the best practice would be to use the corresponding dbSNP version to do this), but I only have chr:pos and allele 1 and allele 2. I do not have the frequency of these alleles in the population but assuming the population is EUR.

I found this bcftools plugin interesting (https://github.com/freeseek/score#liftover-vcfs), which claims to avoid limitations of other liftover tools.

Thank you in advance for any insight.

liftOver • 1.4k views
ADD COMMENT
3
Entering edit mode
3 months ago

As the author of BCFtools/liftover, I am going to be biased but I advise to avoid relying on dbSNP as rsIDs are not stable. Also notice that the choice of the tool is not the only option that matters. The choice of which liftover chain to use also matters

ADD COMMENT
0
Entering edit mode

wow thank you Giulio, so definitely not a good idea to convert through rsIDs... So now I would not go through rsIDs anymore when trying to liftOver but always use only chr:pos...

I was using the chain file from UCSC http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz but I see from your link that there are different ones actually. Which one do you suggest? I will read more.

ADD REPLY
0
Entering edit mode

This chain file, which you and Giulio have mentioned, is exactly that same chain file that I mentioned indirectly in my answer, if you may follow the link that I provided.

ADD REPLY
0
Entering edit mode

For a liftover from hg19 to hg38 I have observed that the UCSC chain file drops ten times less variants than the Ensembl chain file, so it is the one I would recommend using.

ADD REPLY
0
Entering edit mode

question for Giulio, do these issues also happen within the same dbSNP build - specifically for my task, I am wondering if, when the exact dbSNP build used to map positions to rsID is known, it is indeed ok to use that dbSNP built to liftOver variants.

ADD REPLY
1
Entering edit mode

I think there are lingering issues. Take STRs as an example. An rsID will sometimes not correspond to a pair of alleles but to a set of alleles. And I don't see how you automatically get a map of which alleles correspond to what across genome assemblies just by looking up the rsID. Take rsID rs10565182 (previously rs66483207). It has the following VCF representations in hg19 and hg38:

chr10 47138263 GATA G
chr10 46411479 A AATT

But can you evince this from the dbSNP record?

ADD REPLY
0
Entering edit mode

Thank you Giulio. There seems to continue being differences between tools, for example using BCFtools/liftover a SNP that is in 37 build at 10:102493680 with BCFtools/liftover becomes 10:100733922 while with other tools it is 10:100733923. Same for 10:10152460, with BCFtools/liftover is 10:10110496 while with other tools it is 10:10110497, or 10:106392798 with BCFtools/liftover is 10:104633037 while with other tools it is 10:104633040, and several others in my data.

ADD REPLY
0
Entering edit mode

I have tried and, at least at the VCF level, I get something different from what you suggest. Assuming you are talking about SNP rs72553532:

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr10>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "chr10\t102493680\trs72553532\tA\tAC,AT\t.\t.\t.") | \
bcftools +liftover -- -s hg19.fa -f hg38.fa -c hg19ToHg38.over.chain.gz | \
bcftools view -H
chr10   100733923   rs72553532  A   AC,AT   .   .   .
ADD REPLY
0
Entering edit mode

Thanks Giulio, Ok this difference at least for the particular SNP we are looking at seems to be dependent on the alleles: I am trying to recover the 38 position for this SNP which is an INDEL and has alleles A>-,AA. Specifically, I have the following alleles for that position, which are different from your test above:

(echo "##fileformat=VCFv4.2" 
echo "##contig=<ID=10>" 
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" 
echo -e "10\t102493680\trs72553532\tAA\tA\t.\t.\t.") | bcftools +liftover -- -s hg19.fa -f hg38.fa -c hg19ToHg38.over.chain.gz | bcftools view -H
Lines   total/swapped/reference added/rejected: 1/0/0/0
chr10   100733922   rs72553532  CA  C   .   

. .

I also tried normalizing first using bcftools norm -m+ but the alignment becomes 10 102493679 chr10:100733923 CA C . . . with the same result.

ADD REPLY
0
Entering edit mode

You can try with the option --no-left-align:

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr10>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "chr10\t102493680\trs72553532\tAA\tA\t.\t.\t.") | \
bcftools +liftover -- -s hg19.fa -f hg38.fa -c hg19ToHg38.over.chain.gz --no-left-align | \
bcftools view -H
chr10   100733922   rs72553532  CAAT    CAT .   .   .

Notice that this is the same as variant:

chr10   100733923   rs72553532  AA  A   .   .   .

However, the normalized version is what you get without the --no-left-align option. See here for a thorough explanation for what normalized VCF records are

ADD REPLY
0
Entering edit mode

Ok but the result is the same as before for the lifted position (maps to chr10:100733922 and not to chr10:100733923), is there anyway I can get it to map to the position chr10:100733923?


Same for other SNPs like 10:106392798 with BCFtools/liftover is 10:104633037 while with other tools it is 10:104633040: 


(echo "##fileformat=VCFv4.2" echo "##contig=<ID=chr10>" echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" echo -e "chr10\t106392798\trs78405707\tTT\tT\t.\t.\t.") | \ bcftools +liftover -- -s hg19.fa -f hg38.fa -f hg38.fa -c hg19ToHg38.over.chain.gz | \ bcftools view -H

Lines   total/swapped/reference added/rejected: 1/0/0/0
chr10   104633037   rs78405707  CT  C   .   .   .
ADD REPLY
0
Entering edit mode

As it is a deletion, it does not map to one base pair. Would you say that it is the A at position 100733923 or the A at position 100733924 that got deleted? You cannot say as either deletion gives you the same variant. To get the kind of coordinate you want you would need a post-processing of the output VCF that puts the VCF records back in the non-normalized format you are interested in. You might be able to get away with option --no-left-align and a simple post-processing script

ADD REPLY
0
Entering edit mode

I see, so this mismatch would happen only when the position in question overlaps INDELs. Thank you.

ADD REPLY
0
Entering edit mode
3 months ago

I had posted a Tutorial here: Automated dbSNP lookup by rsID position, plus genome build liftover

Please let me know if it helps [or not]

Kevin

ADD COMMENT
0
Entering edit mode

Thank you Kevin but it sounds like going through rsIDs is anyway a bad option because of the issues posted by Giulio above...

With that in mind, it seems better to start from chr:pos in hg19 and convert the chr:positions to 38. Now I am digging into what are the differences in results between using rtracklayer from your script, or UCSC liftOver from command line, or Giulio's BCFtools/liftover... (it seems as there are differences in speed and BCFtools/liftover should be faster, but from what I understood it requires both the fasta file to be downloaded apart from the chain file so that increases time)

ADD REPLY
0
Entering edit mode

Please note that, at the bottom of my linked answer, there's a method provided that is independent of rs ID. That is that to which I was referring you.

ADD REPLY
1
Entering edit mode

Yes I have tried your script (that is what I was referring to in the comment above as rtracklayer from your script) which works well, but using only the positions unfortunately does not find some SNPs (e.g. rs28525262, position in 37 is 1:1584102, alleles G/A, MAF 0.3), that have an rsID, that is why I was wondering if there was a way to map more of them. I am also mapping with Giulio's BCFtools/liftover but I do not expect much improvement compared to using chr:pos with rtracklayer from your script because it is using the same chain file. Maybe knowing exactly the dbSNP built for the SNPs that did not get lifted could help here...

UPDATE: Giulio's BCFtools/liftover was the approach that mapped more SNPs compared to other approaches.

ADD REPLY

Login before adding your answer.

Traffic: 2455 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6