How do people know to use at least 30X coverage in WGS?
1
3
Entering edit mode
8.2 years ago
DVA ▴ 630

Hello,

I heard many times from different sources that if I'm doing a WGS for SNVs detection, I better have a >=30X coverage after removing duplication. Of curiosity, how did scientists come to this coverage please?

Did some studies compare the result of one sample with 100X coverage (or some coverage deep enough to be a standard) to 30X of a same individual, and conclude that 30X can just as well do a good job? Thanks a lot.

sequencing wgs coverage • 10k views
ADD COMMENT
0
Entering edit mode

Thank you very much.

ADD REPLY
1
Entering edit mode

In my opinion it really depends on what your research question is. If it disease/clinical related you would like to be sure that a variants is there and you dont want the hassle of validating variants with Sanger sequencing so therefore 30X is relative good coverage. Usually a heterozygosity rate of <75% is used so that would mean that at least 7 reads are needed to call a variant in a 30x covered piece of genome... See for a longer discussion also this post: What Is Considered A Good Coverage Depth In Exon Capture Seq

ADD REPLY
0
Entering edit mode

Thanks a lot for the reply:)

ADD REPLY
0
Entering edit mode

Here's a more recent analysis of sensitivity vs. read depth for WGS and WXS: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-247

ADD REPLY
0
Entering edit mode

Thank you for the information

ADD REPLY
0
Entering edit mode

Another reference, about advised coverage in exome sequencing: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-195

We estimate a local read depth of 13X is required to detect the alleles and genotype of a heterozygous SNV 95% of the time, but only 3X for a homozygous SNV. At a mean on-target read depth of 20X, commonly used for rare disease exome sequencing studies, we predict 5-15% of heterozygous and 1-4% of homozygous SNVs in the targeted regions will be missed.

But as said by others it really depends on what you are doing. De novo sequencing or resequencing, short or long reads, CNV detection or SNP detection, research or diagnostic,...

ADD REPLY
0
Entering edit mode

I do SNP detection. Thanks a lot for the information.

ADD REPLY
4
Entering edit mode
8.1 years ago

The necessary coverage depends on the platform and run mode, too. Illumina's newer NextSeq platform, for example, has much lower quality and much less accurate quality scores than their top-quality MiSeq platform, as well as shorter reads. All three of those factors influence how much coverage is needed to accurately call variants. WGS needs lower coverage than exon-capture, though, because it has less bias. Using a NextSeq instead of a HiSeq/MiSeq might double your coverage target; and exon-capture might triple it.

Additionally, Illumina's newer software versions with quantized quality scores are simply not very good for calling variants, which would again increase the necessary coverage for a given confidence level. It's possible to recalibrate the quality scores which will restore the full quality-score range and thus make it possible to more-accurately distinguish SNVs from sequencing error, reducing the necessary coverage, but it's better to just select a platform that does not quantize quality scores in the first place. The newer 2-dye chemistries also seem to decrease quality, and patterned flow-cells decrease average insert size (longer inserts help resolve repeats), so the newer platforms with 2-dye chemistry or patterned flow-cells need more coverage for accurate variant calling.

I'm currently evaluating some NextSeq data from a fungus with 120x coverage. Some of the SNPs are present in 97% of reads; it's pretty obvious they are real. Some are present in 1 read only; they appear to be sequencing error. Some are present in around 25% of reads, with a kind of low average quality score. I'm really not sure about those - are they real? Sequencing error? A collapsed 4-copy repeat in the assembly? If this was MiSeq or HiSeq 2500 data, it would be obvious. But with current NextSeq data, the lowest possible quality score is 14, which indicates over 95% confidence that the call is correct. I have no idea what they are. Others variants are scattered around whole coverage scale, between 2x and 120x; with inaccurate calls and quality scores, it's impossible to accurately call any variants or their ploidy unless you do massive oversequencing, and 30x would absolutely not be sufficient for a haploid, let alone a diploid.

ADD COMMENT
0
Entering edit mode

Illumina has been talking about quality binning for over six years. I have seen multiple lines of evidence from Illumina, Broad, Sanger and my own experiment that quality binning has little to do with the quality of variant calling for human data. I rarely work with nextseq data. I did hear complaints about its data quality from time to time, but I also know people can make variant calls of acceptable quality.

Some are present in around 25% of reads, with a kind of low average quality score.

They may be caused by systematic sequencing errors. HiSeq X10 is getting worse in this aspect. Nextseq may be even worse. A good heuristic is to ignore low-quality bases (e.g. below Q20). Correlated errors tend to have lower base quality. GATK folks told me this ~8 years ago and I think they are right.

ADD REPLY

Login before adding your answer.

Traffic: 2341 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6