Question

How reliable are WGS results?

0

Entering edit mode

2.3 years ago

marongiu.luigi ▴ 730

I have read some studies showing how the results of WGS differ between platforms. In particular, the SNV found by Illumina versus other brands was the same, only about one-quarter of the time.

This raises the question of the accuracy of WGS results. In other words, how likely will I be to confirm an alignment made by WGS by Sanger sequencing or by PCR? How can I be sure that an alignment observed by, for instance, IGV will be sustained by Sanger sequencing? Could I hope to design primers against that segment observed by IGV?

Are there studies covering the issue of WGS reliability?

Thank you.

WGS accuracy quality reliability control • 1.8k views

ADD COMMENT • link 2.3 years ago by marongiu.luigi ▴ 730

2

Entering edit mode

Do you mean studies like this one https://www.nature.com/articles/s41598-022-14395-4 ?

ADD REPLY • link 2.3 years ago by Lluís R. ★ 1.2k

0

Entering edit mode

This is a phenomenal paper, potentially worthy of a general discussion on this forum; particularly as unsupervised dimensionality reduction (PCA, NMF, ICA) is a major component of nearly every bioinformatic analysis; and these will all be sensitive to mismatches between population (or cluster) sizes and contribution to variance. This paper has me somewhat worried regarding the very, very many single-cell datasets that have a workflow of "filter-PCA-cluster-revisefilters-PCA-cluster".

ADD REPLY • link 2.3 years ago by LChart 4.7k

0

Entering edit mode

Interesting paper, but it's focused on genotyping arrays, not WGS.

ADD REPLY • link 2.3 years ago by Kevin Blighe 88k

0

Entering edit mode

The paper is almost entirely off-topic; but worthy of wider discussion in a different context (hence my comment)

ADD REPLY • link 2.3 years ago by LChart 4.7k

1

Entering edit mode

2.3 years ago

LChart 4.7k

I have read some studies showing how the results of WGS differ between platforms. In particular, the SNV found by Illumina versus other brands was the same, only about one-quarter of the time.

Can you provide a citation? A recent study is showing something like 11% discordance (as measured by Jaccard overlap), and from what I can tell this is an overestimate as SNP and MNP calls were not harmonized prior to the calculation: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07362-8

Moreover, the discordance between platforms and analysis pipelines is not uniform throughout the genome. The concordance is overall much higher in the 2.5Gb Genome-In-A-Bottle confident regions.

"In the high-confidence regions, when comparing these pipelines to each other (https://precision.fda.gov/jobs/job-FJpqBP80F3YyfJG02bQzPJBj, link immediately accessible by requesting an account), they agreed on 99.7% of SNVs and 98.7% of indels. Outside the high-confidence regions (https://precision.fda.gov/jobs/job-FJpqJF80F3YyXqz6Kv8Q1BQK), they agreed with each other on only 76.5% of SNVs and 78.7% of indels."

(https://www.nature.com/articles/s41587-019-0054-x). If your variant is falling outside of the difficult regions (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/genome-stratifications/v3.1/GRCh38/Union/) it is much more likely to validate than if it is inside of the difficult regions.

It would be interesting to know how much higher the variant call rate and validation rates are in the difficult versus high-confident regions to provide a fully-fledged odds ratio

In other words, how likely will I be to confirm an alignment made by WGS by Sanger sequencing or by PCR?

I assume you're talking about an "arbitrary" SNV? As a first-pass I might do something like:

Stratify the genome by the GIAB challenge stratifications https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/genome-stratifications/v3.1/GRCh38/
Obtain the best fully-phased reference panel (or pedigree data, possibly HRC: https://ega-archive.org/studies/EGAS00001001710) I can find, and compute the per-sample count of sample-specific non-reference sites (i.e., leave 1 or 2 samples out of the panel, and compute the # of non-reference sites that are specific to sample 1 and to sample 2).
Compute the observed novel non-reference count in my own callset for each of these regions
Appropriately Z-score (3) against (4) {z-scoring the counts is probably not as good as z-scoring the log rates)

Assuming the population of the sample is present in the reference panel, I would expect reasonable scores (-1.5, 1.5) -- e.g., no substantial over-calling -- to validate at a roughly 90%-95% rate; and to very rapidly decay to a point where SNVs effectively never validate for scores > 5.

ADD COMMENT • link 2.3 years ago by LChart 4.7k

0

Entering edit mode

The 25% agrrement come from Lam et al, Nat. Biotechnol. 2011, 30, 78–82 ; I remember another even better where there were substantial differences using different analytical pipeline but I forgot the reference...

ADD REPLY • link 2.3 years ago by marongiu.luigi ▴ 730

score 3 · Accepted Answer · 2022-08-29

With NGS, the problems are multiple, and these [problems] are compounded at every step, right from the wet-lab methods to the dry-lab methods, i.e., along the entire workflow of a NGS run, problems accumulate.

The sequencing-by-synthesis method that Illumina acquired from SOLEXA is error prone and should never be able to stand in any clinical application. In fact, any short read technology, even if they manage to faithfully sequence the DNA that is being fed into the instrument, will later struggle in the dry-lab methods due to the fact that no aligner can faithfully align short reads to the genome due to sequence similarity, pseudogenes, repetitive sequences, et cetera. Long read sequencing technology faces other issues in the wet-lab part of the workflow, perhaps worse than those of short read.

Please take a look at my answer here: Sanger sequencing is no longer the gold standard?

If you have some Illumina data and have followed any standard workflow, just before the variant calling step, when you have your BAMs, please sub-sample the reads in these BAMs using Picard DownsampleSam and then call variants separately in each sub-sampled BAM. When you have each subset of variants, merge them all, i.e., find a consensus. If I were implementing a clinical workflow in the future, I'd downsample at 10%, 5%, or even 1% intervals, depending on compute resource availability.

The above is the only way that I can consistently achieve 100% sensitivity between illumina-based NGS data and Sanger seq, even using BCFtools mpileup.

You may think that I'm crazy but lax regulations in applications is what brought down 2 Boeing MAX aircraft. We don't need the same disaster occurring in healthcare.

Kind regards,

Kevin