Question

show vcf data in a table

0

Entering edit mode

7.6 years ago

sarah.k • 0

Hi, everybody. I have a huge vcf file (http://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_with_tair10_only_ACGTN.vcf.gz). This is a 132GB compressed VCF file (1135 accessions x 117 million SNPs+short Indels). I want to convert it to sql or csv data format for data modeling in nosql databases such as apache cassandra or mongodb. But before that, I wanted to get a better understanding of its structure. Are there any tools to show this data in a table?

Thank you,

vcf nosql sql csv • 7.5k views

ADD COMMENT • link updated 7.6 years ago by Petr Ponomarenko ★ 2.8k • written 7.6 years ago by sarah.k • 0

1

Entering edit mode

Note that VCF is already a table format (tsv) when it's uncompressed so you may not necessarily need to convert it to csv for example. Since the file is so large, you might take a subset of it to view. For example, if you are on linux you can get a small sample of the file using something like gunzip -c largefile.vcf.gz | head -n 1000 > largefile_subset.vcf and then open largefile_subset.vcf in a spreadsheet (excel can even open this small it probably :))

ADD REPLY • link 7.6 years ago by cmdcolin ★ 4.0k

1

Entering edit mode

vcf2tsv from vcflib. There are several small scripts that can convert vcf information to tables.

ADD REPLY • link 7.6 years ago by cpad0112 21k

1

Entering edit mode

If you haven't seen Hail yet, it might be helpful for your end goal. It can load VCF files for Apache Spark analysis.

ADD REPLY • link 7.6 years ago by Robert Sicko ▴ 630

score 3 · Answer 1 · 2017-05-10

VCF format is described here https://samtools.github.io/hts-specs/VCFv4.3.pdf As you can see there are genotype columns the data structure in which is specified in the format column and moreover info column can have different data between rows. So to parse it deeply you may need a good vcf parser like VariantsToTable part of GATK

Before uploading this dataset into a database you may want to filter it. You can use vcftools for this. Splitting to chromosomes will help to speed up the queries. You can calculate lots of different statistics measures to understand your data better using vcftools as well.