Question

DP (read depth) in VCF files?

1

Entering edit mode

7.1 years ago

barah_chbib ▴ 10

In Galaxy when using VCF filter the default parameters are -f"DP>10", why does having higher DP matter for calling genotypes in diploids?

VCF filter • 14k views

ADD COMMENT • link updated 4.2 years ago by Kevin Blighe 88k • written 7.1 years ago by barah_chbib ▴ 10

score 25 · Answer 1 · 2017-10-15

For heterozygous and homozygous germline variants, you don't require your read depth to be too high. From a clinical perspective, the best range in which to call these variants is between read-depth 18 and anything up to 70 or 80. 30 is the sweet-spot, though. You should not expect that higher read-depth than 30 will actually improve accuracy/precision - it may just serve to introduce more error. 18 is the minimum recommended read-depth at which to call variants in the UK clinical genetics scene, and we confirmed this through our validation work comparing NGS against Sanger in the National Health Service England.

They typically use 10 (or lower) as the cut-off in research settings because, being research, they can tolerate more error when their results have no (or minimal) clinical implication(s). However, it is of course possible to call true variants from just 1 (homozygous) or 2 (heterozygous) reads, although you will not find many that recommend this.

In scenarios like cancer and circulating free DNA (cfDNA) analysis, where you'd expect to find variants / mutations at frequencies other than 50% or 100%, having high read depth gives you a better chance of finding low frequency mutations in the 'tumour bulk' that was your biopsy that was sequenced. Thus, 30 is not ideal here. This is of much utility because tumours consist of multiple clones, each with their own mutation profile. When you analyse a tumour bulk biopsy, you're in fact analyising many clones at the same time. If you have a primary tumour and a matched metastatic sample, you can then see how certain mutations may become more (or less) frequent in the metastatic sample as compared to the primary, and thus infer their role in metastasis.

Many variant callers will actually downsample your reads when they call variants, i.e., they only look at 500 or 1000 reads and then don't process any further. This is in part to save processing time and memory.

As to why we need multiple reads, well, NGS is 'messy'. It has many inherent errors at each step in the process (assuming sequencing by synthesis, as is used by Illumina). Error can occur:

When reads are being sequenced in the sequencer, polymerase is known to add incorrect bases, but in the sequencer there are no excision-base DNA repair mechanisms to correct these ( however, there is a PhiX control: Can phasing or pre-phase during basecall cause indel? )
When the base-calling software attempts to read the fluorescent signals to infer which base was added
When the aligner mis-aligns a read (due to the high level of sequence similarity that exists all across the human genome)
When the variant caller incorrectly calls or fails to call a variant
Incorrect quality-scores / programs interpreting quality scores differently

Although I mentioned that you can call true variants from just 1 or 2 reads, this is simply not advised because, what happens? - the aligner will mis-align many reads and assign them a high mapping quality. In a targeted experiment, you'll see dozens or 100s or 1000s of positions across your genome that are just mapped by 1 read. Calling variants on these is a nightmare because you'll get 10s or 100s or thousands of variant calls due to the fact that many of these alignments are errors and thus bases won't exactly match to the reference. This is why we also use BED files for honing down on regions in which we initially targeted (and filtering out information from all others).

One must be aware, though, that even the gold standard Sanger sequencing has inherent errors. Error exists everywhere in technology and we have to understand what our tolerances to —and thresholds for— it are.

Further reading, see here: A: Sanger sequencing is no longer the gold standard?