show vcf data in a table
1
0
Entering edit mode
7.6 years ago
sarah.k • 0

Hi, everybody. I have a huge vcf file (http://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_with_tair10_only_ACGTN.vcf.gz). This is a 132GB compressed VCF file (1135 accessions x 117 million SNPs+short Indels). I want to convert it to sql or csv data format for data modeling in nosql databases such as apache cassandra or mongodb. But before that, I wanted to get a better understanding of its structure. Are there any tools to show this data in a table?

Thank you,

vcf nosql sql csv • 7.4k views
ADD COMMENT
1
Entering edit mode

Note that VCF is already a table format (tsv) when it's uncompressed so you may not necessarily need to convert it to csv for example. Since the file is so large, you might take a subset of it to view. For example, if you are on linux you can get a small sample of the file using something like gunzip -c largefile.vcf.gz | head -n 1000 > largefile_subset.vcf and then open largefile_subset.vcf in a spreadsheet (excel can even open this small it probably :))

ADD REPLY
1
Entering edit mode

vcf2tsv from vcflib. There are several small scripts that can convert vcf information to tables.

ADD REPLY
1
Entering edit mode

If you haven't seen Hail yet, it might be helpful for your end goal. It can load VCF files for Apache Spark analysis.

ADD REPLY
3
Entering edit mode
7.6 years ago

VCF format is described here https://samtools.github.io/hts-specs/VCFv4.3.pdf As you can see there are genotype columns the data structure in which is specified in the format column and moreover info column can have different data between rows. So to parse it deeply you may need a good vcf parser like VariantsToTable part of GATK

Before uploading this dataset into a database you may want to filter it. You can use vcftools for this. Splitting to chromosomes will help to speed up the queries. You can calculate lots of different statistics measures to understand your data better using vcftools as well.

ADD COMMENT

Login before adding your answer.

Traffic: 1645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6