While in general, I agree with Heng's points, we have found the use of Tabix and vcftools very useful for some analyses but a bit cumbersome for more intricate explorations of genetic variation. We find this to be especially true when one wants to put those variants in broader context through comparisons to many different genome annotations.
Our solution, while not yet final, is a new tool that we have been developing called gemini. Our hope for gemini is for it to be used as a standard framework for exploring genetic variation on both family-based studies of disease and for broader population genetic studies. The basic gist is that you load a VCF file into a Gemini database (SQLite is the backend for portability and flexibility). As each variant is read from the VCF, it is annotated via comparisons to many different genome annotation files (e.g., ENCODE, dbSNP, ESP, UCSC, ClinVar, KEGG). Tabix is used to expedite the comparisons. The variants and the associated annotations are stored in a variants table. The attractive aspect to us is that the database framework itself is the API. Once the data is loaded, one can ask quite complex questions of one's data. With the help of Brad Chapman and Rory Kirchner, we have parallelized the loading step, which is, by far, the slowest aspect. To use multiple cores, one would do:
gemini load -v my.vcf --cores 20 my.db
One can also farm the work out to LSF, SGE, and Torque clusters. Once loaded, one can query the database via the command line:
gemini query -q "select chrom, start, end, ref, alt, \
aaf, hwe, in_dbsnp, is_lof, impact, num_het \
from variants" \
my.db
Also, we represent sample genotype information in compressed numpy arrays that are stored as binary BLOB columns to minimize storage and allow scalability (i.e., as opposed to having a genotypes _table_ where the number of rows is N variants * M samples: bad). As such, we have extended the SQL framework to allow struct-like access to the genotype info. For example,
the following query finds rare, LoF variants meeting an autosomal recessive inheritance model (note that the --gt-filter
option uses Python, not SQL syntax).
gemini query -q “select chrom, start, end,
ref, alt, gene,
impact, aaf, gts.kid
from variants
where in_dbsnp = 0
and aaf < 0.01
and in_omim = 1
and is_lof = 1”
--gt-filter
“gt_types.mom = HET
and
gt_types.dad = HET
and
gt_types.kid = HOM_ALT”
There is also a Python interface allowing one to write custom Python scripts that interface with the Gemini DB. An example script can be found at: http://gist.github.com/arq5x/5236411.
There are many built-in tools for things such as finding compound hets, de novo mutations, protein-protein interactions, pathway analysis, etc.
Gemini scales rather well. We recently loaded entire 1000 Genomes VCF (1092 samples) in 26 hours using 30 cores. The queries are surprisingly fast. While this scale is not our current focus (we are more focused on medical genetics), it is encouraging to see it work well with 100s of individuals.
If interested, check out the documentation. Comments welcome.
related:
Do people import VCF files into databases?
Somewhat similar but that question deals with whether or not to import VCF. I am saying I am importing it into a DBMS and was wondering what system would be optimal
but isn't that still a very similar question - it may have the word "import" there but importing also implies making use a database system to represent VCF files. Your question as you pose it feels more nebulous. The evaluation of a database system depends solely on the queries that one wishes to perform and from this question we don't know what your needs are.
True, but that is why I am trying to see what other people's needs have been in the past and to see what database system worked for them and why. I am in a new division and have been asked to build out a VCF database for queries and so the questions that will be asked are not very well known.
I think the answer here depends a lot on what sorts of questions you are asking. What do you want to get out of the data? VCF records have always seemed appropriate for a document store (ie, Cassandra or the interesting-but-new HyperDex), but that depends entirely on the questions you want to ask of the data.