How to validate the SNP calling pipeline.
I am looking for SNP calling validation methodologies.
How to validate the SNP calling pipeline.
I am looking for SNP calling validation methodologies.
The recommended benchmark dataset is from NIST(National Institute of Standards and Technology) GIAB. They update frequently with new data and sequencing tech
Paper: https://www.nature.com/articles/nbt.2835
Data download: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/
A recommended validation tool to compare your result with benchmark is VarMatch
:
Paper: https://academic.oup.com/bioinformatics/article/33/9/1301/2736365.
Tool download: https://github.com/medvedevgroup/varmatch
The recommended command to run VarMatch:
./varmatch -b benchmark.vcf -q query.vcf -g ref.fa -o output -f
Here benchmark.vcf
is the benchmark data. If you use above GIAB benchmark, it is HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf
query.vcf
is your SNP calling result
ref.fa
is the reference genome sequence.
VarMatch provides 2x2x3 = 12 different strategy combinations to validate, and they are all very simple and straightforward to understand. In certain conditions, different strategies can lead to different conclusions about whether your pipeline is better than others. I would recommend the default strategy using -f
parameter. But if you really are interested in the detail of different strategies, you can read the paper.
Here is 300x coverage NGS reads for human in FASTQ file format: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/
The issue that I have with this 'reference' dataset (GIAB) is that none of the variants in it are confirmed by the gold standard. Metrics are instead derived by comparing variant calls between multiple variant callers processing the same data. Over the genome scale, I believe the possibility is very high that all callers will be missing some variants that are in the GIAB samples. The NIST guidelines, therefore, are biased, and consist of comparisons between one erroneous analytical method and another.
We were given a 'bad deal' in the beginning with the short-read technologies by Solexa, who were later acquired by Illumina, who then pushed it to market. Technicians developing the technology at the time were aware of the error rates but these were not revealed.
I know of this new paper, though I have not yet had time to investigate. Run your pipeline on the benchmark.
New synthetic-diploid benchmark for accurate variant calling evaluation
I was Lead Bioinformatician during the enrollment of a clinical genetics NGS service in a lab of the National Health Service in the UK. Our benchmark was Sanger sequencing performed on the same samples that we used for validating the NGS pipeline that I introduced. We had 100% sensitivity to Sanger over our regions of interest (read these words carefully). I am not mentioning specificity here.
Take a look at a modified version of the pipeline here (parts 6 and 7 are those that stand-out from other pipelines and greatly help to boost sensitivity to the gold standard): https://github.com/kevinblighe/ClinicalGradeDNAseq
In the USA, the ACMG (American College of Medical Genetics) is the body that usually sets standards in this area; whilst, in the UK and Europe, generally, you should look up NEQAS and the UKGTN.
Finally, I have compiled lecture notes on this topic, if you would be interested in having them.
Kevin
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
sanger sequencing.
Thanks, Pierre. I want to validate the called SNPs bioinformatically not via sanger sequencing.
You'll need to explain a bit more if you expect a useful answer.
The main objective is to use the most widely-supported variant calling validation methodologies and build on them for the purpose of advanced annotation and phenotype interpretation. Trying to compile a list of metrics to validate.
For example, I would like to look at SNP genotyping errors such as ignoring the reference allele, adding the reference allele, and other SNP calling errors.