I am having issue when trying to compare two VCF files. I have variant calls using nanomonsv on ONT long-read data of the COLO829/COLO829BL tumour/normal pair and a somatic truthset in VCF format. note: I am using the truthset lifted over to GRCh38 and I have modified the chromosome names in the file from 1,2,3.. to chr1,chr2... etc. to allow for comparison with VCFs produced from variant callers.
I am trying to use bedtools intersect
to see which variants are reported in both VCF files. I have played around with -f
and -r
parameters to see extent of overlap.
My issue is, whilst I get some overlap i.e. variants reported in both files, no deletions are reported as overlapping even though they should from my perspective.
For example, the truthset contains:
chr1 117973563 truthset_2_1 T T[1:117973599[ . PASS SVLEN=36;SVTYPE=DEL;SUPP_SEQ=ILL,ONT,PB;SUPP_VAL=PCR,CAPTURE;GENE=SPAG17;CLUSTER=. GT 0/1
This same variant is reported by nanomonsv:
chr1 117973563 d_2 T <DEL> . PASS END=117973598;SVTYPE=DEL;SVLEN=-35 TR:VR 32:14 32:0
I thought maybe this was because the SVLEN was reported as an absolute value in the truthset but was a negative integer in the other. But when I changed this so both were negative (as is more commonly reported in VCFs produced by SV callers), the outcome did not change.
What might be the reason these variants are not reported as overlapping?
vcf breakends likely are not handled well by simple tools like a bedtools intersect. breakends (the T[1:117973599[ syntax, see vcf spec chapter on it) are in a very unique universe. my personal belief is that much more/better tooling is needed to properly interpret breakends. i think maybe, in the case of this "simple" deletion, the breakend syntax is likely unnecessary and should have been reported as a simple
"<DEL>"
and it would be easier to compare across experiments, but there are many more complex breakend cases, and those require essentially a graph traversal since breakends induce a graph on the genomeThanks for this reply. I see your point about the breakend end notation. Although seems strange as both are identified as a deletion and not SVTYPE=BND. I agree that a better tool for comparisons are needed. I like using intersect as it has some flexibility on the extent of overlap compared to others e.g. bcftools. Can you see any workaround to this? Perhaps converting VCF to bedpe format then using pairtopair? Cheers