I have tried downloading 1000 genomes phase 3 vcf files from http://www.internationalgenome.org and have used gzip to decompress them. I have used two different computers to do this and on one I get a substantially larger uncompressed file than on the other. I have tried to repeat the download to see if it was a one off error but the same issue occurs.
Here are the file sizes for a couple of the chromosomes:
Chromosome File size 1 File size 2
chr1 188.3mb 65.78gb
chr3 203.4mb 59.31gb
chr5 5.3gb 53.55gb
All three of the .vcf.gz files were around 1gb each before unzipping. I have the same version of .gzip on both systems (though surely that cannot cause such a diverse range of results anyway?) As you can see those are wildly different files sizes so I'd like to know which are correct, and why this issue may be occurring? Many thanks.
maybe to return a question to you to get more info, why are you unzipping them in the first place? many tasks can be done on the compressed files themselves, or the decompressed content can be piped to another program, but that doesn't require unzipping the file onto the hard disk
Hi, thanks for your response. I want to annotate them (using snpEff) and then parse them using PyVCF. snpEff takes a vcf file as input so I assumed I would have to unzip the file? Are you suggesting running gzip and then piping output straight to snpEff, rather than doing the steps separately?
As far as I know SnpEff can also take gzipped files.
I have just downloaded all the vcf files onto a server using the wget command:
I have tried running snpEff on the .gz file of one chromosome, and using zcat to unzip a different one. The file sizes are again too large. The snpEff file is currently 37gb in size (code still running) whilst the original .gz file was only 1.1gb.
The unzipped file is currently 21gb in size (still unzipping) whilst the original .gz file was 1.3gb.I do not understand how such small files can be unzipped to become so large?
Does anyone have any thoughts on this? Or can advise an alternative method? i have tried to get variation data from BiomaRt but it keeps freezing when trying to obtain such large datasets, and the UCSC table browser cannot provide me with all the information that is within the 1000genomes vcf files.
You should not have to unzip the VCF.gz files, snpeff will read the compressed data directly or you can pipe the data in (gunzip -c yourfile.vcf.gz | snpeff ...). See http://snpeff.sourceforge.net/SnpEff_manual.html
You can also gzip your output...
You should state exactly which files are your downloading.
I would say that you are not downloading the same files or that the download is incomplete/corrupted, hence the divergence.
A 1 Gb file would not uncompress into either 188Mb or 65Gb. Hence neither size looks correct.
Hi Istvan, thank you for your response. I am using the following command to download the files:
Which are the phase 3 .vcf files from the available data section of this page: http://www.internationalgenome.org/data. As you say, neither size looks correct but it seems odd that when I repeat the download on either machine the download size comes out the same each time.
The path in your command doesn't lead to a file but to a directory.
Apologies, have edited it as appropriate.