Hi, I am trying to benchmark a few variant calling softwares to compare precision and recall.
I have decided to use exome data as file sizes are smaller than WGS.
But an issue encountered with GIAB's high confidence truth set vcf is that it is based from WGS data which gives rise to single digit recall percentages when i used a Variant Calling Assessment tool.
During the Variant calling assessment i have also included the Nextera Rapid Capture Exome Targeted Regions Manifest bed file. It resulted in a minute increase in recall%.
I've published a paper about various alignment/mapping and variant calling algorithms a year ago, and the main message was that the proper aligner/mapper has a much higher effect on precision/recall than variant callers. Maybe you also want to take a look at that.
yes I have observed that too, when trying to reproduce a paper,
to my surprise, I found that straight out of the box, bcftools produced output that was basically identical to GATK (as a matter of fact, a tiny bit better) - even though the latter was an order of magnitude more complex to run, with all the recalibration etc
I also looked up your paper and will link it below:
My apologies, i am not a programmer by training. I'm still trying to pick up and learn a little. However, i was using illumina's basespace variant calling assessment tool which uses hap.py if i remember correctly. Their inputs are just to include query VCFs and target region BED files. I am not sure if Garvan's exome panel was sequenced comprehensively which might explain low recall?
I have also begun to look at more recent sequencing runs using HG001/NA12878.
Here is the link
But looking at the reads it seems very different from regular FastQ data
In these situations, look at the resulting files that tell you which variants went into false positive/false negative categories. That will immediately tell you whether you are using the tools incorrectly or whether the regions don't match or are something similar.
The low recall makes me suspect that the BED file might not match the areas, perhaps a different build, etc.
In general a simple visualization in IGV of both files will also be hugely helpful, you can eyeball the recall and precision from that as well.
I've published a paper about various alignment/mapping and variant calling algorithms a year ago, and the main message was that the proper aligner/mapper has a much higher effect on precision/recall than variant callers. Maybe you also want to take a look at that.
yes I have observed that too, when trying to reproduce a paper,
to my surprise, I found that straight out of the box,
bcftools
produced output that was basically identical to GATK (as a matter of fact, a tiny bit better) - even though the latter was an order of magnitude more complex to run, with all the recalibration etcI also looked up your paper and will link it below:
https://www.nature.com/articles/s41598-022-26181-3
we're probably going to have to see your hap.py commands to get to the bottom of this
My apologies, i am not a programmer by training. I'm still trying to pick up and learn a little. However, i was using illumina's basespace variant calling assessment tool which uses hap.py if i remember correctly. Their inputs are just to include query VCFs and target region BED files. I am not sure if Garvan's exome panel was sequenced comprehensively which might explain low recall?
I have also begun to look at more recent sequencing runs using HG001/NA12878. Here is the link But looking at the reads it seems very different from regular FastQ data