How to validate the SNP calling pipeline?
3
0
Entering edit mode
6.9 years ago

How to validate the SNP calling pipeline.

I am looking for SNP calling validation methodologies.

DNASeq variant calling SNP • 4.8k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks, Pierre. I want to validate the called SNPs bioinformatically not via sanger sequencing.

ADD REPLY
0
Entering edit mode

You'll need to explain a bit more if you expect a useful answer.

ADD REPLY
0
Entering edit mode

The main objective is to use the most widely-supported variant calling validation methodologies and build on them for the purpose of advanced annotation and phenotype interpretation. Trying to compile a list of metrics to validate.

For example, I would like to look at SNP genotyping errors such as ignoring the reference allele, adding the reference allele, and other SNP calling errors.

ADD REPLY
4
Entering edit mode
6.9 years ago
Chen Sun ★ 1.1k

The recommended benchmark dataset is from NIST(National Institute of Standards and Technology) GIAB. They update frequently with new data and sequencing tech

Paper: https://www.nature.com/articles/nbt.2835

Data download: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/

A recommended validation tool to compare your result with benchmark is VarMatch:

Paper: https://academic.oup.com/bioinformatics/article/33/9/1301/2736365.

Tool download: https://github.com/medvedevgroup/varmatch

The recommended command to run VarMatch: ./varmatch -b benchmark.vcf -q query.vcf -g ref.fa -o output -f

Here benchmark.vcf is the benchmark data. If you use above GIAB benchmark, it is HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf

query.vcf is your SNP calling result

ref.fa is the reference genome sequence.

ADD COMMENT
0
Entering edit mode

Thanks for the article and tools, Chen. I will read this paper and go over the tool for benchmarking. What are the standard strategies used to validate their variant calling pipeline?

ADD REPLY
0
Entering edit mode

VarMatch provides 2x2x3 = 12 different strategy combinations to validate, and they are all very simple and straightforward to understand. In certain conditions, different strategies can lead to different conclusions about whether your pipeline is better than others. I would recommend the default strategy using -f parameter. But if you really are interested in the detail of different strategies, you can read the paper.

ADD REPLY
0
Entering edit mode

Thank you for the comment. Where should i find the fastQ files for my pipeline?

ADD REPLY
1
Entering edit mode

Here is 300x coverage NGS reads for human in FASTQ file format: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/

ADD REPLY
0
Entering edit mode

The issue that I have with this 'reference' dataset (GIAB) is that none of the variants in it are confirmed by the gold standard. Metrics are instead derived by comparing variant calls between multiple variant callers processing the same data. Over the genome scale, I believe the possibility is very high that all callers will be missing some variants that are in the GIAB samples. The NIST guidelines, therefore, are biased, and consist of comparisons between one erroneous analytical method and another.

We were given a 'bad deal' in the beginning with the short-read technologies by Solexa, who were later acquired by Illumina, who then pushed it to market. Technicians developing the technology at the time were aware of the error rates but these were not revealed.

ADD REPLY
2
Entering edit mode
6.9 years ago

I know of this new paper, though I have not yet had time to investigate. Run your pipeline on the benchmark.

New synthetic-diploid benchmark for accurate variant calling evaluation

https://www.biorxiv.org/content/early/2017/11/22/223297

ADD COMMENT
0
Entering edit mode

Thanks for the article, Istvan. I will read this paper.

ADD REPLY
2
Entering edit mode
6.9 years ago

I was Lead Bioinformatician during the enrollment of a clinical genetics NGS service in a lab of the National Health Service in the UK. Our benchmark was Sanger sequencing performed on the same samples that we used for validating the NGS pipeline that I introduced. We had 100% sensitivity to Sanger over our regions of interest (read these words carefully). I am not mentioning specificity here.

Take a look at a modified version of the pipeline here (parts 6 and 7 are those that stand-out from other pipelines and greatly help to boost sensitivity to the gold standard): https://github.com/kevinblighe/ClinicalGradeDNAseq

In the USA, the ACMG (American College of Medical Genetics) is the body that usually sets standards in this area; whilst, in the UK and Europe, generally, you should look up NEQAS and the UKGTN.

Finally, I have compiled lecture notes on this topic, if you would be interested in having them.

Kevin

ADD COMMENT
0
Entering edit mode

Thanks, Kevin, I would really appreciate if you could share the lecture notes.

ADD REPLY

Login before adding your answer.

Traffic: 2169 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6