Say we take a 40x whole human genome BAM file of HiSeq reads (~100GB), call variants but do not annotate further, and create a VCF with every position called (even if that position matches the reference genome), then compress. How big will the VCF and BCF files be?
(this question pre-dated gVCFs but is left here for archival purposes)
Do I understand this correctly? Every base in reference genome needs to be a line in a VCF file?
yep............
VCF file can have more than one patient in it, are you talking a single patient? If more, then it will affect the size.
yes that is true. To give some background, this question was asked because we are facing many 'ref or no coverage' mysteries in our trios when samples are called individually. I was wondering if a viable solution is simply to call all positions.