I am currently working with 1000 Genomes latest released data, which is a large >60GB .vcf.gz file. I am having difficulties to process it as I used to process .vcf.gz files before, and for that reason I would like to split it into smaller files.
my first idea is to split it into chromosomes, but I have thoroughly checked the vcftools site and I haven't find any valid way of doing such split. I know I can extract chromosome lines with vcf tools, but if I query this large file for each chromosome wouldn't it be accessed (hence read) 22 times for the 22 chromosomes I want?
I have a home-made perl script that is capable of doing it by parsing the entire file and checking each line's contents, but I'm pretty sure it will be slow. I just wanted to know if anyone would like to suggest anything more elegant rather than starting to process the file and waiting for the results.
http://samtools.sourceforge.net/mpileup.shtml