Question

GIAB Benchmarking

0

Entering edit mode

10 months ago

Mat • 0

Hi, I am trying to benchmark a few variant calling softwares to compare precision and recall.

I have decided to use exome data as file sizes are smaller than WGS.

But an issue encountered with GIAB's high confidence truth set vcf is that it is based from WGS data which gives rise to single digit recall percentages when i used a Variant Calling Assessment tool.

During the Variant calling assessment i have also included the Nextera Rapid Capture Exome Targeted Regions Manifest bed file. It resulted in a minute increase in recall%.

Could someone help with this?

GIAB HG001 NGS benchmarking WES • 1.1k views

ADD COMMENT • link 10 months ago by Mat • 0

1

Entering edit mode

I've published a paper about various alignment/mapping and variant calling algorithms a year ago, and the main message was that the proper aligner/mapper has a much higher effect on precision/recall than variant callers. Maybe you also want to take a look at that.

ADD REPLY • link 10 months ago by DBScan ▴ 450

1

Entering edit mode

yes I have observed that too, when trying to reproduce a paper,

to my surprise, I found that straight out of the box, bcftools produced output that was basically identical to GATK (as a matter of fact, a tiny bit better) - even though the latter was an order of magnitude more complex to run, with all the recalibration etc

I also looked up your paper and will link it below:

https://www.nature.com/articles/s41598-022-26181-3

ADD REPLY • link 10 months ago by Istvan Albert 101k

0

Entering edit mode

we're probably going to have to see your hap.py commands to get to the bottom of this

ADD REPLY • link 10 months ago by Jeremy Leipzig 22k

0

Entering edit mode

My apologies, i am not a programmer by training. I'm still trying to pick up and learn a little. However, i was using illumina's basespace variant calling assessment tool which uses hap.py if i remember correctly. Their inputs are just to include query VCFs and target region BED files. I am not sure if Garvan's exome panel was sequenced comprehensively which might explain low recall?

I have also begun to look at more recent sequencing runs using HG001/NA12878. Here is the link But looking at the reads it seems very different from regular FastQ data

ADD REPLY • link 10 months ago by Mat • 0

score 0 · Answer 1 · 2024-01-12

In these situations, look at the resulting files that tell you which variants went into false positive/false negative categories. That will immediately tell you whether you are using the tools incorrectly or whether the regions don't match or are something similar.

The low recall makes me suspect that the BED file might not match the areas, perhaps a different build, etc.

In general a simple visualization in IGV of both files will also be hugely helpful, you can eyeball the recall and precision from that as well.