Different Size VCF
Entering edit mode
8 weeks ago
Sd • 0

Hello. I have 708 total genomic interval gvcfs. I have performed GATK GenotypeGVCFs and then SelectVariants on genomic intervals and created VCFs. I have noticed significant file size decrease in one of the genomic intervals between GenotypeGVCFs and SelectVariants outputs. I have done several things to see what is the reason for the file size change.

chr1-4410000_chr1-8821000.GenotypeGVCFs.vcf.gz >>>>> 2.1GB

$ wc chr1-4410000_chr1-8821000.GenotypeGVCFs.vcf.gz
  20064   23261679 2204917451 chr1-4410000_chr1-8821000.GenotypeGVCFs.vcf.gz

chr1-4410000_chr1-8821000.SelectVariants.vcf.gz. >>>>> 698MB

$ wc chr1-4410000_chr1-8821000.SelectVariants.vcf.gz
  1920129  14569274 731773426 chr1-4410000_chr1-8821000.SelectVariants.vcf.gz 

What is this file difference comming from? What is happening in the in the vcf from the SelectVariants?

One more thing: It is weird that the unmerged 708 VCFs total size for GenotypeGVCFs and SelectVariants are 450GB and 420GB respectively. However. after merging with MergeVcfs with compression level 6, the both merged vcf file size of GenotypeGVCFs and SelectVariants are 261GB. Any thought on this?

GATK VCF compression • 291 views
Entering edit mode

well, SelectVariants is used to filter out variants, so unless you don't filter anything with SelectVariants, I don't understand where is the problem.

wc is not the right tool to get the size of a binary file, just use , for example, ls -l


Login before adding your answer.

Traffic: 1504 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6