Question

troubleshooting benchmarking small variants: hap.py and rtg

1

Entering edit mode

4.0 years ago

emma.a ▴ 130

Hi!

I tried to do what other posts reported and I have a problem that I do not fully understand why ...

1) I downloaded the fastq files from Garvan (https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/) with the bed file. I had to convert the bed file to hg38 (my_regions) ... as I understand it is in hg19.

2) I get the vcf (truth.vcf) and high confidence (confidence.bed) files from here https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh38/

3) I ran the fastq files with the GATK codes, up to HaplotypeCaller. Garvan use 2 libraries for the same sample. I used them separately, I did not merg them at the end, before the variant calling.

4) To compere my query data (output of HaplotypeCaller) with the GIAB truth test I ran these two codes, to have also a .html file:

hap.py ${truth.vcf} ${query.vcf} -f ${confident} -o /hap.py_results/${outname} -r ${Hg38}

python rep.py -o /hap.py_results/"${outname}.html" "${outname}"_hap.py:/hap.py_results/"${outname}.roc.all.csv.gz"

5) I re-ran my results also with a similar code ... I found and tested it in a GA4GH tutorial:

hap.py ${truth.vcf} ${query.vcf} -f ${confident.bed} -o /hap.py_results/"${outname}" --engine=vcfeval --engine-vcfeval-path=rtg --no-decompose --no-leftshift

6) I ran also

rtg vcfeval -b ${truth.vcf} -c ${query.vcf} -o /hap.py_results/rtgHC/ -t ${ref_sdf} -e ${confidence.bed} --region=${my_regions}

Error: After applying regions there were no sequence names in common between the reference and the supplied variant sets. Check the regions supplied by --region or --bed-regions are correct.

my_regions:

chr1    826206  827522
chr1    827683  827775

...

confidence regions:

chr1    821881  822046
chr1    823126  823188
chr1    823426  823479
chr1    823580  826321
chr1    826481  827827
chr1    827928  839291

...

RESULTS:

With the codes 5 and 6 the problem is that I have a recall of 1.65% ... the precision is 97.07% and the F-score is 0.033

What is the problem about "recall" and at the end with F-score? The intervals? How can I fix the problem with the bed files? How they done the analysis considering that the intervals in the to bed files do not overlap?

Many thanks for your time!

benchmarking • 3.1k views

ADD COMMENT • link updated 3.6 years ago by Jeremy Leipzig 23k • written 4.0 years ago by emma.a ▴ 130

0

Entering edit mode

can we see a snippet of the query.vcf?

ADD REPLY • link 4.0 years ago by Jeremy Leipzig 23k

1

Entering edit mode

chr1 14677 . G A 471.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.113;DP=53;ExcessHet=3.0103;FS=3.209;MBQ=30,30;MFRL=164,149;MLEAC=1;MLEAF=0.500;MMQ=24,24;MPOS=26;MQ=28.34;MQRankSum=0.903;QD=9.83;ReadPosRankSum=-0.156;SOR=0.246 GT:AD:DP:GQ:PL 0/1:26,22:48:99:479,0,581

ADD REPLY • link 4.0 years ago by emma.a ▴ 130

0

Entering edit mode

just to be clear your liftover regions are only used with GATK, not hap.py or rtg, correct?

ADD REPLY • link 4.0 years ago by Jeremy Leipzig 23k

0

Entering edit mode

In GATK (BQSR and HaplotypeCaller) I used the lifted over of the original "nexterarapidcapture_expandedexome_targetedregions.bed.gz"

The same file (lifted over) I used also in the point 6, --region=${my_regions}

In the points 4 and 5 (-f option) I used "HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7.bed"

ADD REPLY • link 4.0 years ago by emma.a ▴ 130

2

Entering edit mode

you need to run vcfeval with --bed-regions not --regions, the latter is for inline <sequence_name>:<start>-<end>-type arguments

ADD REPLY • link 4.0 years ago by Jeremy Leipzig 23k

1

Entering edit mode

You are great man!!! I really thank you !!! Now at least the code in the point 6 works!

Threshold    Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
 3.000       0.9437      0.8938      0.9181
 None        0.9432      0.8940      0.9180

Now I have to fix points 4 and 5!

In case, are general guidelines to improve the pipelines? It is not so good for the moment looking the data ...

In general is better to use WGS than WES data for this kind of analysis, always about the NA12878?

Many thanks

ADD REPLY • link 4.0 years ago by emma.a ▴ 130

0

Entering edit mode

a lot of benchmarking focuses on HG002 NA24385. I think exomes or even smaller subsets of genes are still more popular than WGS for benchmarks. The company I am working for, Truwl, is developing some dashboards to make it a lot easier to manage these type of comparisons. I'll follow up here when a demo is ready.

ADD REPLY • link 4.0 years ago by Jeremy Leipzig 23k

0

Entering edit mode

Many thanks, I will also try with HG002 !

ADD REPLY • link 4.0 years ago by emma.a ▴ 130

1

Entering edit mode

Ok to follow up with this thread we have released our variant benchmarking workflow https://truwl.com/workflows/library/Variant%20Benchmarking/0.2 that does all this stuff for you

ADD REPLY • link 3.6 years ago by Jeremy Leipzig 23k