Question

what sql schema to store a vcf file ?

1

Entering edit mode

8.5 years ago

sacha ★ 2.4k

I would like to know what kind of sql schema do you propose to store a vcf file ? Is there a common way to proceed ? I imagine a table "variant" which contains : chr start ref alt and a second table "sample" with a n-n relation . But How to store genotype ? How to store info field ?

sql vcf variant annotation • 7.5k views

ADD COMMENT • link 8.5 years ago by sacha ★ 2.4k

2

Entering edit mode

I imagine a table "variant" which contains : chr start ref alt and a second table

not so easy, there is more than one ALT per variant... ALT can be a symbolic allele, a very large string, etc...

I wrote a vcf2sql : https://github.com/lindenb/jvarkit/wiki/VCF2SQL but in the end, it was useless. I found it easier to only store the path to the tabix-indexed VCF files

see also : Vcf And Mongodb

ADD REPLY • link 8.5 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Is MongoDB a better alternative than postgreSQL ? What's your opinion ? my goal is to perform some set operation like : listVariantA - ( ListVariantB | ListVariantC) . I m not sure Nosql database is able to do this job faster.

ADD REPLY • link 8.5 years ago by sacha ★ 2.4k

1

Entering edit mode

listVariantA - ( ListVariantB | ListVariantC)

GATK SelectVariants with "concordance" or "discordance" https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

ADD REPLY • link 8.5 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I know.. ! But Thanks!

ADD REPLY • link 8.5 years ago by sacha ★ 2.4k

0

Entering edit mode

What about duplicating lines ? A > T,C become : A > T A > C

ADD REPLY • link 8.5 years ago by sacha ★ 2.4k

0

Entering edit mode

what about het variants ? storing genotypes like "1/2" == "T/C"

ADD REPLY • link 8.5 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Dear Pierre,

Kindly help me with this question, Filtering multisample VCF based on genotype using SnpSift filter

ADD REPLY • link 8.5 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

I've seen this asked many times and have never seen a good, compelling answer. Which makes me suspect that many people think SQL is for some reason, not a good solution and they prefer to use tools which work with the flat file.

ADD REPLY • link 8.5 years ago by Neilfws 49k

score 3 · Answer 1 · 2016-07-11

3

Entering edit mode

8.5 years ago

andrew.j.skelton73 6.6k

I'd suggest you look at the Gemini database schema (readthedocs seems to be down at the moment, but I'm pretty sure that's the right page)

ADD COMMENT • link 8.5 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

Thanks ! I already look on this ! I get some ideas !

ADD REPLY • link 8.5 years ago by sacha ★ 2.4k

0

Entering edit mode

Thanks, Andrew and Pierre for your links.

Hi Sacha, MongoDB vs MySQL which is good enough to build VCF database for multiple samples?

ADD REPLY • link 8.5 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

that's a good question. MongoDB is really powerful to store Data as Document. It should be faster than sql to display data if you query has low complexity.
But I guess Sql is a better option if you need to perform complex query like Intersection . The main problem with the VCF file is optional data as key-value. It's easy to implement with mongodb because it a key:value database. But sql database needs fixed tables. For my project I use PostgreSQL witch support natively JSON type and Array Type .I store my key-value data into an Array type field. { { key, value}, {key, value}}

ADD REPLY • link 8.5 years ago by sacha ★ 2.4k

score 0 · Answer 2 · 2016-07-12

0

Entering edit mode

8.5 years ago

sacha ★ 2.4k

Attach there is my sql schema to store vcf. It's looks pretty similar than variant_tools database

Briefly : VCF is file . This file has samples. Each sample is attached to a variant . The variant database contains chr,pos,ref,alt as a unique key. info field in "sample_has_variant" contains optional values( Postgre Array type).
genotype field contains : 0 = homozygotes ref , 1 = heterozygote 2 = homozygote alt -1 = composite

Sql database schema

ADD COMMENT • link 8.5 years ago by sacha ★ 2.4k

0

Entering edit mode

Thanks, Sacha.

In the case of multiple sample VCF, do you vcf-subset each sample into a single sample VCF file?

Can we upload multisample VCF into a MySQL table by changing the number of columns dynamically?

ADD REPLY • link 8.5 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

I suggest you to check variant_tools software. They load multisample vcf file in an sqlite file. It works pretty .

ADD REPLY • link 8.5 years ago by sacha ★ 2.4k