comparing VCF files
0
0
Entering edit mode
7 months ago
eebloom ▴ 90

I am having issue when trying to compare two VCF files. I have variant calls using nanomonsv on ONT long-read data of the COLO829/COLO829BL tumour/normal pair and a somatic truthset in VCF format. note: I am using the truthset lifted over to GRCh38 and I have modified the chromosome names in the file from 1,2,3.. to chr1,chr2... etc. to allow for comparison with VCFs produced from variant callers.

I am trying to use bedtools intersect to see which variants are reported in both VCF files. I have played around with -f and -r parameters to see extent of overlap.

My issue is, whilst I get some overlap i.e. variants reported in both files, no deletions are reported as overlapping even though they should from my perspective.

For example, the truthset contains:

chr1    117973563   truthset_2_1    T   T[1:117973599[  .   PASS    SVLEN=36;SVTYPE=DEL;SUPP_SEQ=ILL,ONT,PB;SUPP_VAL=PCR,CAPTURE;GENE=SPAG17;CLUSTER=.  GT  0/1

This same variant is reported by nanomonsv:

chr1    117973563   d_2 T   <DEL>   .   PASS    END=117973598;SVTYPE=DEL;SVLEN=-35  TR:VR   32:14   32:0

I thought maybe this was because the SVLEN was reported as an absolute value in the truthset but was a negative integer in the other. But when I changed this so both were negative (as is more commonly reported in VCFs produced by SV callers), the outcome did not change.

What might be the reason these variants are not reported as overlapping?

long-reads VCF variants ONT nanopore • 550 views
ADD COMMENT
0
Entering edit mode

vcf breakends likely are not handled well by simple tools like a bedtools intersect. breakends (the T[1:117973599[ syntax, see vcf spec chapter on it) are in a very unique universe. my personal belief is that much more/better tooling is needed to properly interpret breakends. i think maybe, in the case of this "simple" deletion, the breakend syntax is likely unnecessary and should have been reported as a simple "<DEL>" and it would be easier to compare across experiments, but there are many more complex breakend cases, and those require essentially a graph traversal since breakends induce a graph on the genome

ADD REPLY
0
Entering edit mode

Thanks for this reply. I see your point about the breakend end notation. Although seems strange as both are identified as a deletion and not SVTYPE=BND. I agree that a better tool for comparisons are needed. I like using intersect as it has some flexibility on the extent of overlap compared to others e.g. bcftools. Can you see any workaround to this? Perhaps converting VCF to bedpe format then using pairtopair? Cheers

ADD REPLY

Login before adding your answer.

Traffic: 1561 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6