I would like to know what kind of sql schema do you propose to store a vcf file ? Is there a common way to proceed ? I imagine a table "variant" which contains : chr start ref alt and a second table "sample" with a n-n relation . But How to store genotype ? How to store info field ?
not so easy, there is more than one ALT per variant... ALT can be a symbolic allele, a very large string, etc...
I wrote a vcf2sql : https://github.com/lindenb/jvarkit/wiki/VCF2SQL but in the end, it was useless. I found it easier to only store the path to the tabix-indexed VCF files
see also : Vcf And Mongodb
Is MongoDB a better alternative than postgreSQL ? What's your opinion ? my goal is to perform some set operation like : listVariantA - ( ListVariantB | ListVariantC) . I m not sure Nosql database is able to do this job faster.
GATK SelectVariants with "concordance" or "discordance" https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php
I know.. ! But Thanks!
What about duplicating lines ? A > T,C become : A > T A > C
what about het variants ? storing genotypes like "1/2" == "T/C"
Dear Pierre,
Kindly help me with this question, Filtering multisample VCF based on genotype using SnpSift filter
I've seen this asked many times and have never seen a good, compelling answer. Which makes me suspect that many people think SQL is for some reason, not a good solution and they prefer to use tools which work with the flat file.