How to validate the SNP calling pipeline?

How to validate the SNP calling pipeline?

0

Entering edit mode

7.8 years ago

bioinforesearchquestions ▴ 370

How to validate the SNP calling pipeline.

I am looking for SNP calling validation methodologies.

DNASeq variant calling SNP • 5.8k views

ADD COMMENT • link updated 7.8 years ago by Kevin Blighe 89k • written 7.8 years ago by bioinforesearchquestions ▴ 370

1

Entering edit mode

sanger sequencing.

ADD REPLY • link 7.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks, Pierre. I want to validate the called SNPs bioinformatically not via sanger sequencing.

ADD REPLY • link 7.8 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

You'll need to explain a bit more if you expect a useful answer.

ADD REPLY • link 7.8 years ago by WouterDeCoster 48k

0

Entering edit mode

The main objective is to use the most widely-supported variant calling validation methodologies and build on them for the purpose of advanced annotation and phenotype interpretation. Trying to compile a list of metrics to validate.

For example, I would like to look at SNP genotyping errors such as ignoring the reference allele, adding the reference allele, and other SNP calling errors.

ADD REPLY • link 7.8 years ago by bioinforesearchquestions ▴ 370

4

Entering edit mode

7.8 years ago

Chen Sun ★ 1.1k

The recommended benchmark dataset is from NIST(National Institute of Standards and Technology) GIAB. They update frequently with new data and sequencing tech

Paper: https://www.nature.com/articles/nbt.2835

Data download: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/

A recommended validation tool to compare your result with benchmark is VarMatch:

Paper: https://academic.oup.com/bioinformatics/article/33/9/1301/2736365.

Tool download: https://github.com/medvedevgroup/varmatch

The recommended command to run VarMatch: ./varmatch -b benchmark.vcf -q query.vcf -g ref.fa -o output -f

Here benchmark.vcf is the benchmark data. If you use above GIAB benchmark, it is HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf

query.vcf is your SNP calling result

ref.fa is the reference genome sequence.

ADD COMMENT • link 7.8 years ago by Chen Sun ★ 1.1k

0

Entering edit mode

Thanks for the article and tools, Chen. I will read this paper and go over the tool for benchmarking. What are the standard strategies used to validate their variant calling pipeline?

ADD REPLY • link 7.8 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

VarMatch provides 2x2x3 = 12 different strategy combinations to validate, and they are all very simple and straightforward to understand. In certain conditions, different strategies can lead to different conclusions about whether your pipeline is better than others. I would recommend the default strategy using -f parameter. But if you really are interested in the detail of different strategies, you can read the paper.

ADD REPLY • link 7.8 years ago by Chen Sun ★ 1.1k

0

Entering edit mode

Thank you for the comment. Where should i find the fastQ files for my pipeline?

ADD REPLY • link 7.1 years ago by juhuvn • 0

1

Entering edit mode

Here is 300x coverage NGS reads for human in FASTQ file format: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/

ADD REPLY • link 7.0 years ago by Chen Sun ★ 1.1k

0

Entering edit mode

The issue that I have with this 'reference' dataset (GIAB) is that none of the variants in it are confirmed by the gold standard. Metrics are instead derived by comparing variant calls between multiple variant callers processing the same data. Over the genome scale, I believe the possibility is very high that all callers will be missing some variants that are in the GIAB samples. The NIST guidelines, therefore, are biased, and consist of comparisons between one erroneous analytical method and another.

We were given a 'bad deal' in the beginning with the short-read technologies by Solexa, who were later acquired by Illumina, who then pushed it to market. Technicians developing the technology at the time were aware of the error rates but these were not revealed.

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k

2

Entering edit mode

7.8 years ago

Istvan Albert 103k

I know of this new paper, though I have not yet had time to investigate. Run your pipeline on the benchmark.

New synthetic-diploid benchmark for accurate variant calling evaluation

https://www.biorxiv.org/content/early/2017/11/22/223297

ADD COMMENT • link 7.8 years ago by Istvan Albert 103k

0

Entering edit mode

Thanks for the article, Istvan. I will read this paper.

ADD REPLY • link 7.8 years ago by bioinforesearchquestions ▴ 370

2

Entering edit mode

7.8 years ago

Kevin Blighe 89k

I was Lead Bioinformatician during the enrollment of a clinical genetics NGS service in a lab of the National Health Service in the UK. Our benchmark was Sanger sequencing performed on the same samples that we used for validating the NGS pipeline that I introduced. We had 100% sensitivity to Sanger over our regions of interest (read these words carefully). I am not mentioning specificity here.

Take a look at a modified version of the pipeline here (parts 6 and 7 are those that stand-out from other pipelines and greatly help to boost sensitivity to the gold standard): https://github.com/kevinblighe/ClinicalGradeDNAseq

In the USA, the ACMG (American College of Medical Genetics) is the body that usually sets standards in this area; whilst, in the UK and Europe, generally, you should look up NEQAS and the UKGTN.

Finally, I have compiled lecture notes on this topic, if you would be interested in having them.

Kevin

ADD COMMENT • link 7.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks, Kevin, I would really appreciate if you could share the lecture notes.

ADD REPLY • link 7.8 years ago by bioinforesearchquestions ▴ 370

Login before adding your answer.

Similar Posts

Loading Similar Posts

Traffic: 16483 users visited in the last hour

Content Search
Users
Tags
Badges

Help About
FAQ

Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the

version 2.3.6