More than 90% of variants were failed to match, using LiftoverVcf of Picard.
2
0
Entering edit mode
4 months ago

Hi, I'm having a problem using Picard's LiftoverVcf.

To lift over a vcf.gz file from hg38 to hg19, I ran the code below. The file passes were correct in the real script.

Picard LiftoverVcf -I Iutput.vcf.gz \
-O Output.vcf.gz \
-CHAIN hg38ToHg19.over.chain.gz \
-REJECT Rejected.vcf.gz \
-R hg19.fasta \
--LOG_FAILED_INTERVALS \
--WARN_ON_MISSING_CONTIG \
-USE_JDK_DEFLATER true \
-USE_JDK_INFLATER true

It did run, but the result was strange; more than 90 % of variants failed to match.

Iutput.vcf.gz was created from a bam file using HaplotypeCaller in GATK. There was no problem with the contigs. I tried using LIFTOVER_MIN_MATCH option with 0.90, but the result didn't change.

Should I use other options? Or is another process required for Iutput.vcf.gz before lifting over?

I'm not planning to use other tools for now, so please give me hints only about Picard's LiftoverVcf.

Best, Koki

Liftover LiftoverVcf picard • 727 views
ADD COMMENT
0
Entering edit mode

Can you show a few examples from the Rejected.vcf.gz file for variants that failed to match?

ADD REPLY
0
Entering edit mode

Thank you for replying. This is a part of Rejected.vcf.gz.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  KS007
chr1    1       .       N       <NON_REF>       .       NoTarget        END=10118       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
chr1    10119   .       C       <NON_REF>       .       IndelStraddlesMultipleIntervals END=10150       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,45
chr1    10152   .       A       <NON_REF>       .       IndelStraddlesMultipleIntervals END=10157       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,45
chr1    10158   .       A       <NON_REF>       .       IndelStraddlesMultipleIntervals END=10160       GT:DP:GQ:MIN_DP:PL      0/0:1:0:1:0,0,0
chr1    10161   .       C       <NON_REF>       .       IndelStraddlesMultipleIntervals END=10162       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,45
chr1    10164   .       A       <NON_REF>       .       IndelStraddlesMultipleIntervals END=10169       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,45
chr1    10170   .       A       <NON_REF>       .       IndelStraddlesMultipleIntervals END=10262       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
chr1    10263   .       A       <NON_REF>       .       IndelStraddlesMultipleIntervals END=10303       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,31
chr1    10305   .       A       <NON_REF>       .       IndelStraddlesMultipleIntervals END=10309       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,31
chr1    10311   .       A       <NON_REF>       .       IndelStraddlesMultipleIntervals END=10320       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,31

I showed first 10 lines. Thank you.

ADD REPLY
3
Entering edit mode
4 months ago

you're trying to liftover a GVCF file, not a VCF ., https://gatk.broadinstitute.org/hc/en-us/articles/360035531812-GVCF-Genomic-Variant-Call-Format

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

there is no reason why you should liftover a GVCF file.

Run gatk GenotypeGVCF before.

ADD COMMENT
0
Entering edit mode

Runnifg gatk GenotypeGVCF, the problem was solved completely!!!

Only about 1.5% was mismatched with defaut LIFTOVER_MIN_MATCH (=1).

Thank you very much!

ADD REPLY
1
Entering edit mode
4 months ago

If you look at the source code of LiftoverVcf you find the following code:

for (final VariantContext ctx : in) {
    ++total;
    final Interval source = new Interval(ctx.getContig(), ctx.getStart(), ctx.getEnd(), false, ctx.getContig() + ":" + ctx.getStart() + "-" + ctx.getEnd());
    final Interval target = liftOver.liftOver(source, LIFTOVER_MIN_MATCH);

    // target is null when there is no good liftover for the context. This happens either when it fall in a gap
    // where there isn't a chain, or if a large enough proportion of it is diminished by the "deletion" at the
    // end of each interval in a chain.
    if (target == null) {
        rejectVariant(ctx, FILTER_NO_TARGET);
        continue;
    }

    // the target is the lifted-over interval comprised of the start/stop of the variant context,
    // if the sizes of target and ctx do not match, it means that the interval grew or shrank during
    // liftover which must be due to straddling multiple intervals in the liftover chain.
    // This would invalidate the indel as it isn't clear what the resulting alleles should be.
    if (ctx.getReference().length() != target.length()) {
        rejectVariant(ctx, FILTER_INDEL_STRADDLES_TWO_INTERVALS);
        continue;
    }

Take variant chr1 10119 . C <NON_REF> . . END=10150 as an example. This variant will return:

ctx.getStart()=10119
ctx.getEnd()=10150
ctx.getReference().length()=1
target.length()=32

So that the condition ctx.getReference().length() == target.length() is not satisfied. The problem is that Picard LiftoverVcf expects the INFO END field to be equal to POS + length of REF allele - 1, which has to be true for precise variants. However, since what you have is a GVCF and not a VCF and the variant includes a symbolic allele, this is not the case and it does not match what Picard LiftoverVcf expects. You would have better luck using a more sophisticated tool such as BCFtools/liftover but in general it is not recommend to perform liftovers of GVCF files

ADD COMMENT
0
Entering edit mode

Thank you for telling me the source script in detail! As you pointed, I didn't convert vcf.gz file to GVCF.

Using GenotypeGVCF, it went succesfully. Only about 1.5% was mismatched with defaut LIFTOVER_MIN_MATCH (=1).

Thank you very much!

ADD REPLY

Login before adding your answer.

Traffic: 1694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6