Hi, everybody. I have a huge vcf file (http://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_with_tair10_only_ACGTN.vcf.gz). This is a 132GB compressed VCF file (1135 accessions x 117 million SNPs+short Indels). I want to convert it to sql or csv data format for data modeling in nosql databases such as apache cassandra or mongodb. But before that, I wanted to get a better understanding of its structure. Are there any tools to show this data in a table?
Thank you,
Note that VCF is already a table format (tsv) when it's uncompressed so you may not necessarily need to convert it to csv for example. Since the file is so large, you might take a subset of it to view. For example, if you are on linux you can get a small sample of the file using something like
gunzip -c largefile.vcf.gz | head -n 1000 > largefile_subset.vcf
and then open largefile_subset.vcf in a spreadsheet (excel can even open this small it probably :))vcf2tsv from vcflib. There are several small scripts that can convert vcf information to tables.
If you haven't seen Hail yet, it might be helpful for your end goal. It can load VCF files for Apache Spark analysis.