With NGS, the problems are multiple, and these [problems] are compounded at every step, right from the wet-lab methods to the dry-lab methods, i.e., along the entire workflow of a NGS run, problems accumulate.
The sequencing-by-synthesis method that Illumina acquired from SOLEXA is error prone and should never be able to stand in any clinical application. In fact, any short read technology, even if they manage to faithfully sequence the DNA that is being fed into the instrument, will later struggle in the dry-lab methods due to the fact that no aligner can faithfully align short reads to the genome due to sequence similarity, pseudogenes, repetitive sequences, et cetera. Long read sequencing technology faces other issues in the wet-lab part of the workflow, perhaps worse than those of short read.
Please take a look at my answer here: Sanger sequencing is no longer the gold standard?
If you have some Illumina data and have followed any standard workflow, just before the variant calling step, when you have your BAMs, please sub-sample the reads in these BAMs using Picard DownsampleSam and then call variants separately in each sub-sampled BAM. When you have each subset of variants, merge them all, i.e., find a consensus. If I were implementing a clinical workflow in the future, I'd downsample at 10%, 5%, or even 1% intervals, depending on compute resource availability.
The above is the only way that I can consistently achieve 100% sensitivity between illumina-based NGS data and Sanger seq, even using BCFtools mpileup.
You may think that I'm crazy but lax regulations in applications is what brought down 2 Boeing MAX aircraft. We don't need the same disaster occurring in healthcare.
Kind regards,
Kevin
Do you mean studies like this one https://www.nature.com/articles/s41598-022-14395-4 ?
This is a phenomenal paper, potentially worthy of a general discussion on this forum; particularly as unsupervised dimensionality reduction (PCA, NMF, ICA) is a major component of nearly every bioinformatic analysis; and these will all be sensitive to mismatches between population (or cluster) sizes and contribution to variance. This paper has me somewhat worried regarding the very, very many single-cell datasets that have a workflow of "filter-PCA-cluster-revisefilters-PCA-cluster".
Interesting paper, but it's focused on genotyping arrays, not WGS.
The paper is almost entirely off-topic; but worthy of wider discussion in a different context (hence my comment)