Hi!
I tried to do what other posts reported and I have a problem that I do not fully understand why ...
1) I downloaded the fastq files from Garvan (https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/) with the bed file. I had to convert the bed file to hg38 (my_regions) ... as I understand it is in hg19.
2) I get the vcf (truth.vcf) and high confidence (confidence.bed) files from here https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh38/
3) I ran the fastq files with the GATK codes, up to HaplotypeCaller. Garvan use 2 libraries for the same sample. I used them separately, I did not merg them at the end, before the variant calling.
4) To compere my query data (output of HaplotypeCaller) with the GIAB truth test I ran these two codes, to have also a .html file:
hap.py ${truth.vcf} ${query.vcf} -f ${confident} -o /hap.py_results/${outname} -r ${Hg38}
python rep.py -o /hap.py_results/"${outname}.html" "${outname}"_hap.py:/hap.py_results/"${outname}.roc.all.csv.gz"
5) I re-ran my results also with a similar code ... I found and tested it in a GA4GH tutorial:
hap.py ${truth.vcf} ${query.vcf} -f ${confident.bed} -o /hap.py_results/"${outname}" --engine=vcfeval --engine-vcfeval-path=rtg --no-decompose --no-leftshift
6) I ran also
rtg vcfeval -b ${truth.vcf} -c ${query.vcf} -o /hap.py_results/rtgHC/ -t ${ref_sdf} -e ${confidence.bed} --region=${my_regions}
Error: After applying regions there were no sequence names in common between the reference and the supplied variant sets. Check the regions supplied by --region or --bed-regions are correct.
my_regions:
chr1 826206 827522
chr1 827683 827775
...
confidence regions:
chr1 821881 822046
chr1 823126 823188
chr1 823426 823479
chr1 823580 826321
chr1 826481 827827
chr1 827928 839291
...
RESULTS:
With the codes 5 and 6 the problem is that I have a recall of 1.65% ... the precision is 97.07% and the F-score is 0.033
What is the problem about "recall" and at the end with F-score? The intervals? How can I fix the problem with the bed files? How they done the analysis considering that the intervals in the to bed files do not overlap?
Many thanks for your time!
can we see a snippet of the query.vcf?
just to be clear your liftover regions are only used with GATK, not hap.py or rtg, correct?
In GATK (BQSR and HaplotypeCaller) I used the lifted over of the original "nexterarapidcapture_expandedexome_targetedregions.bed.gz"
The same file (lifted over) I used also in the point 6, --region=${my_regions}
In the points 4 and 5 (-f option) I used "HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7.bed"
you need to run vcfeval with
--bed-regions
not--regions
, the latter is for inline<sequence_name>:<start>-<end>
-type argumentsYou are great man!!! I really thank you !!! Now at least the code in the point 6 works!
Now I have to fix points 4 and 5!
In case, are general guidelines to improve the pipelines? It is not so good for the moment looking the data ...
In general is better to use WGS than WES data for this kind of analysis, always about the NA12878?
Many thanks
a lot of benchmarking focuses on HG002 NA24385. I think exomes or even smaller subsets of genes are still more popular than WGS for benchmarks. The company I am working for, Truwl, is developing some dashboards to make it a lot easier to manage these type of comparisons. I'll follow up here when a demo is ready.
Many thanks, I will also try with HG002 !
Ok to follow up with this thread we have released our variant benchmarking workflow https://truwl.com/workflows/library/Variant%20Benchmarking/0.2 that does all this stuff for you