Different Size VCF
0
0
Entering edit mode
4 months ago
Sd • 0

Hello. I have 708 total genomic interval gvcfs. I have performed GATK GenotypeGVCFs and then SelectVariants on genomic intervals and created VCFs. I have noticed significant file size decrease in one of the genomic intervals between GenotypeGVCFs and SelectVariants outputs. I have done several things to see what is the reason for the file size change.

chr1-4410000_chr1-8821000.GenotypeGVCFs.vcf.gz >>>>> 2.1GB

$ wc chr1-4410000_chr1-8821000.GenotypeGVCFs.vcf.gz
  20064   23261679 2204917451 chr1-4410000_chr1-8821000.GenotypeGVCFs.vcf.gz

chr1-4410000_chr1-8821000.SelectVariants.vcf.gz. >>>>> 698MB

$ wc chr1-4410000_chr1-8821000.SelectVariants.vcf.gz
  1920129  14569274 731773426 chr1-4410000_chr1-8821000.SelectVariants.vcf.gz 

What is this file difference comming from? What is happening in the in the vcf from the SelectVariants?

One more thing: It is weird that the unmerged 708 VCFs total size for GenotypeGVCFs and SelectVariants are 450GB and 420GB respectively. However. after merging with MergeVcfs with compression level 6, the both merged vcf file size of GenotypeGVCFs and SelectVariants are 261GB. Any thought on this?

GATK VCF compression • 334 views
ADD COMMENT
1
Entering edit mode

well, SelectVariants is used to filter out variants, so unless you don't filter anything with SelectVariants, I don't understand where is the problem.

wc is not the right tool to get the size of a binary file, just use , for example, ls -l

ADD REPLY

Login before adding your answer.

Traffic: 1713 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6