Hello.
I have 708 total genomic interval gvcfs. I have performed GATK GenotypeGVCFs
and then SelectVariants
on genomic intervals and created VCFs. I have noticed significant file size decrease in one of the genomic intervals between GenotypeGVCFs
and SelectVariants
outputs. I have done several things to see what is the reason for the file size change.
chr1-4410000_chr1-8821000.GenotypeGVCFs.vcf.gz >>>>> 2.1GB
$ wc chr1-4410000_chr1-8821000.GenotypeGVCFs.vcf.gz
20064 23261679 2204917451 chr1-4410000_chr1-8821000.GenotypeGVCFs.vcf.gz
chr1-4410000_chr1-8821000.SelectVariants.vcf.gz. >>>>> 698MB
$ wc chr1-4410000_chr1-8821000.SelectVariants.vcf.gz
1920129 14569274 731773426 chr1-4410000_chr1-8821000.SelectVariants.vcf.gz
What is this file difference comming from? What is happening in the in the vcf from the SelectVariants?
One more thing: It is weird that the unmerged 708 VCFs total size for GenotypeGVCFs
and SelectVariants
are 450GB and 420GB respectively. However. after merging with MergeVcfs
with compression level 6, the both merged vcf file size of GenotypeGVCFs
and SelectVariants
are 261GB. Any thought on this?
well, SelectVariants is used to filter out variants, so unless you don't filter anything with SelectVariants, I don't understand where is the problem.
wc
is not the right tool to get the size of a binary file, just use , for example,ls -l