What Is The Expected Size Of A Whole Genome Vcf And Bcf?
1
3
Entering edit mode
12.4 years ago

Say we take a 40x whole human genome BAM file of HiSeq reads (~100GB), call variants but do not annotate further, and create a VCF with every position called (even if that position matches the reference genome), then compress. How big will the VCF and BCF files be?

(this question pre-dated gVCFs but is left here for archival purposes)

vcf • 15k views
ADD COMMENT
0
Entering edit mode

Do I understand this correctly? Every base in reference genome needs to be a line in a VCF file?

ADD REPLY
0
Entering edit mode

yep............

ADD REPLY
0
Entering edit mode

VCF file can have more than one patient in it, are you talking a single patient? If more, then it will affect the size.

ADD REPLY
0
Entering edit mode

yes that is true. To give some background, this question was asked because we are facing many 'ref or no coverage' mysteries in our trios when samples are called individually. I was wondering if a viable solution is simply to call all positions.

ADD REPLY
4
Entering edit mode
12.4 years ago
Rok ▴ 190

Under the assumption that each line will similar to this one:

chr1    249250621    .    A    A    22    PASS        0/0

This means each line uses at max 45 bytes. Times length of human genome this makes VCF file of maximum size around 125GB. Size of the header is not used in the calculation since it's insignificant compared to the rest of the file.

I don't know much about the BCF format and the effects of compression. A wild guess from the would be that the compression will reduce size of the file under 30GB.

ADD COMMENT

Login before adding your answer.

Traffic: 3030 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6