Hi,
I analyzed a few human Exome-Seq data sets, and I noticed that their fastq files were around 2.5 Gb each. I am analyzing another set, where the fastq files are around 900 Mb. After aligning both with hg38, and following the same pipeline, I noticed upon variant calling (GATK HaplotypeCaller) that the number of variants in the older data set (2.5 Gb fastq) were approximately 300,000, while in the current data set (900 Mb fastq), there are only around 100,000 variants.
I understand that a higher coverage would have given the software more confidence to identify variants, but is it possible to observe a linear correlation in the decrease of variants by approximately 3 times upon reduction in the number of reads by approximately 3 times, or is there something that I'm missing?
Thank you.
I would argue that it strongly depends on the variant caller. If you use tools like
VarScan2
which use a statistical framework to calculate probabilities for a certain genotype to be present then decreasing read numbers will reduce power and therefore subsampling would reduce the number and confidence of variants. Not sure how GATK calls cariants though.That too - yes (i.e. the variant caller)
Thank you both for the insight. Although, it is still not clear to me whether a decrease in fastq file size will linearly reduce the number of variants?