Question

Existing methods to evaluate the performance of SV/CNV calling tools

2

Entering edit mode

8.7 years ago

Leandro Lima ▴ 970

Hi everyone

Does anyone know existing pipelines to evaluate SV/CNV calling tools.

Suppose I have a BED file with the ground truth for simulated SVs, and BED files with results for different tools.

Which measures do people use to compare the performance of such tools?

I know that some people use "touch" (considering at least 1 base of overlap), or "percentage" (taking into account a threshold percentage to consider a hit) to calculate concordance, precision, etc.

Are there other measures or tools you guys are aware of?

Thanks in advance

Leandro

SV CNV • 2.6k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 8.7 years ago by Leandro Lima ▴ 970

Ram · Answer 1 · 2016-05-10

2

Entering edit mode

8.5 years ago

QVINTVS_FABIVS_MAXIMVS ★ 2.6k

I recently polished off version 1.0 of my machine learning script that genotypes CNV [gtCNV][1]

I'm working on a paper detailing the methods. Our group likes to avoid heuristic filters and instead apply a more systematic approach for prioritizing CNV. We implement in silico genotyping using two methods: SVtyper and my method.

Both methods have advantages and limitations. That's why we use both. My script will take your bed file and generate genotype likelihoods for the variants. As well as annotate each variant to 1000 Genomes CNV positions, repeats, genes, and MEIs.

Blast+ db problem

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 8.5 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.6k

0

Entering edit mode

Hi @QVINTVS,

I actually have simulated data with the ground truth in a BED file. I also have the calls for different tools (CNVnator, SVDetect, Lumpy). What I want to do is to evaluate these calls and I was wondering if someone had used different methods before.

Thanks, Leandro

ADD REPLY • link 8.5 years ago by Leandro Lima ▴ 970

1

Entering edit mode

Here's what we did. Because we were systematic with our CNV prioritization (i.e. going back to the BAM files and genotyping each putative CNV) our sensitivity was much greater than if you applied the suggested filters for some CNV callers. Lumpy is actually pretty descent sometimes and you can filter based on a high QUAL score. We determined a suitable QUAL cutoff for gtCNV and others by calculating FDR with arrays (IRS test from SVtoolkit).

If you only have positions and no BAM files and no array data then I can't think of anything else but to filter the positions based on the reported QUAL score in the raw data and/or filtering on LCR, abParts, centromeres, telomeres, and any gaps in the genome (+ heterochromatin) Doing this, you will get a good FDR but much lower sensitivity. Sensitivity is important for us because we are interested in de novo events.

ADD REPLY • link 8.5 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.6k