Question

Correlation between coverage and variant calling

0

Entering edit mode

5.6 years ago

vinayjrao ▴ 260

Hi,

I analyzed a few human Exome-Seq data sets, and I noticed that their fastq files were around 2.5 Gb each. I am analyzing another set, where the fastq files are around 900 Mb. After aligning both with hg38, and following the same pipeline, I noticed upon variant calling (GATK HaplotypeCaller) that the number of variants in the older data set (2.5 Gb fastq) were approximately 300,000, while in the current data set (900 Mb fastq), there are only around 100,000 variants.

I understand that a higher coverage would have given the software more confidence to identify variants, but is it possible to observe a linear correlation in the decrease of variants by approximately 3 times upon reduction in the number of reads by approximately 3 times, or is there something that I'm missing?

Thank you.

SNP variant-calling Exome-Seq • 1.5k views

ADD COMMENT • link updated 23 months ago by Ram 45k • written 5.6 years ago by vinayjrao ▴ 260

score 3 · Answer 1 · 2020-02-19

3

Entering edit mode

5.6 years ago

Kevin Blighe 89k

I understand that a higher coverage would have given the software more confidence to identify variants

Not necessarily.

Anecdotal evidence constantly suggests a relationship between position read depth [at which a variant is being called] and the false-positive and false-negative rate of variant calling.

To keep this short, you can break this dependency by random sampling your reads from your main aligned and QCd BAM, and then re-calling variants on each random sub-set. At the end, you then take the consensus calls. I elaborate on this, here: A: Best tool for variant calling

Kevin

ADD COMMENT • link 5.6 years ago by Kevin Blighe 89k

3

Entering edit mode

I would argue that it strongly depends on the variant caller. If you use tools like VarScan2 which use a statistical framework to calculate probabilities for a certain genotype to be present then decreasing read numbers will reduce power and therefore subsampling would reduce the number and confidence of variants. Not sure how GATK calls cariants though.

ADD REPLY • link 5.6 years ago by ATpoint 89k

0

Entering edit mode

That too - yes (i.e. the variant caller)

ADD REPLY • link 5.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank you both for the insight. Although, it is still not clear to me whether a decrease in fastq file size will linearly reduce the number of variants?

ADD REPLY • link 5.5 years ago by vinayjrao ▴ 260

score 1 · Answer 2 · 2020-02-20

Also there are quite a few relevant blog posts from Brad Chapman about this on bcbio and a lovely article by Heng Li.

Heng Li's article: https://academic.oup.com/bioinformatics/article/30/20/2843/2422145

Bcbio posts: https://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/

Production scripts on variant filtering in bcbio toolkit: https://github.com/bcbio/bcbio-nextgen/blob/98c12fdaa8ce6ab9c6c1fdfb4db39df9c7b548ff/bcbio/variation/vfilter.py#L120

While some of these posts might be older, they still offer an immense value.