How large are 1000genomes vcf files?
1
0
Entering edit mode
7.4 years ago
spiral01 ▴ 110

I have tried downloading 1000 genomes phase 3 vcf files from http://www.internationalgenome.org and have used gzip to decompress them. I have used two different computers to do this and on one I get a substantially larger uncompressed file than on the other. I have tried to repeat the download to see if it was a one off error but the same issue occurs.

Here are the file sizes for a couple of the chromosomes:

Chromosome File size 1  File size 2
chr1                188.3mb    65.78gb
chr3                203.4mb    59.31gb
chr5                  5.3gb    53.55gb

All three of the .vcf.gz files were around 1gb each before unzipping. I have the same version of .gzip on both systems (though surely that cannot cause such a diverse range of results anyway?) As you can see those are wildly different files sizes so I'd like to know which are correct, and why this issue may be occurring? Many thanks.

vcf • 4.7k views
ADD COMMENT
4
Entering edit mode

maybe to return a question to you to get more info, why are you unzipping them in the first place? many tasks can be done on the compressed files themselves, or the decompressed content can be piped to another program, but that doesn't require unzipping the file onto the hard disk

ADD REPLY
0
Entering edit mode

Hi, thanks for your response. I want to annotate them (using snpEff) and then parse them using PyVCF. snpEff takes a vcf file as input so I assumed I would have to unzip the file? Are you suggesting running gzip and then piping output straight to snpEff, rather than doing the steps separately?

ADD REPLY
2
Entering edit mode

As far as I know SnpEff can also take gzipped files.

ADD REPLY
1
Entering edit mode

I have just downloaded all the vcf files onto a server using the wget command:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502//*.gz

I have tried running snpEff on the .gz file of one chromosome, and using zcat to unzip a different one. The file sizes are again too large. The snpEff file is currently 37gb in size (code still running) whilst the original .gz file was only 1.1gb.

The unzipped file is currently 21gb in size (still unzipping) whilst the original .gz file was 1.3gb.I do not understand how such small files can be unzipped to become so large?

Does anyone have any thoughts on this? Or can advise an alternative method? i have tried to get variation data from BiomaRt but it keeps freezing when trying to obtain such large datasets, and the UCSC table browser cannot provide me with all the information that is within the 1000genomes vcf files.

ADD REPLY
0
Entering edit mode

You should not have to unzip the VCF.gz files, snpeff will read the compressed data directly or you can pipe the data in (gunzip -c yourfile.vcf.gz | snpeff ...). See http://snpeff.sourceforge.net/SnpEff_manual.html

ADD REPLY
0
Entering edit mode

You can also gzip your output...

ADD REPLY
1
Entering edit mode

You should state exactly which files are your downloading.

I would say that you are not downloading the same files or that the download is incomplete/corrupted, hence the divergence.

A 1 Gb file would not uncompress into either 188Mb or 65Gb. Hence neither size looks correct.

ADD REPLY
1
Entering edit mode

Hi Istvan, thank you for your response. I am using the following command to download the files:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502//*.gz

Which are the phase 3 .vcf files from the available data section of this page: http://www.internationalgenome.org/data. As you say, neither size looks correct but it seems odd that when I repeat the download on either machine the download size comes out the same each time.

ADD REPLY
1
Entering edit mode

The path in your command doesn't lead to a file but to a directory.

ADD REPLY
1
Entering edit mode

Apologies, have edited it as appropriate.

ADD REPLY
3
Entering edit mode
7.4 years ago

I have downloaded and unpacked file

It is indeed 1.1Gb zipped and 61Gb unpacked.

The reason for this massive difference (and this rate of compression also surprised me as well) is that most of the data consist of the allele representations: 0|0 or 1|0 with no other information in each column. Hence most of the file consists of long rows of 0|0s.

A somewhat simplified explanation is that compression works by finding repetitive elements and replacing them with a single instance plus a count for the repetition. Since the file is extremely repetitive and lacks information in most blocks it compresses really well.

Alas it is not clear how to deal with this, perhaps you could filter down this data to a subset with bcftools

ADD COMMENT
0
Entering edit mode

Hi Istvan, many thanks. I used bcftools as per your recommendation and was able to easily parse out the allele representations, bringing uncompressed file sizes way down to around 800mb each.

ADD REPLY

Login before adding your answer.

Traffic: 1772 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6