VCF is ok as a suitcase for small-scale variation and to a lesser extent, annotation. But you can't live out of your suitcase forever.
VCF isn't a database, and will never support region and sample queries at scale or at "web-speed" in the era of national biobanks. Even its usefulness in transmitting variants is not sustainable past a few thousand samples. Annotation can also be problematic given that everything needs to be serialized into the INFO field. The shift away from joint genotyping and toward single sample gVCFs as the preferred currency further muddies the waters.
There are three or four major types of successors to VCFs as variant warehouses that are worth mentioning.
Spark-based (requires a Spark cluster to scale):
- hail.MatrixTable - based on Parquet. Hail powers a number of analyses on gnomAD, UK Biobank, and other large genomic datasets.
- Glow Spark Dataframe - based on Spark and DeltaLake, Glow offers GloWGR, a distributed version of the regenie GWAS package. Provides user-defined functions (UDFs) and variant normalization functions.
Cloud-vendor managed solutions
Distributed SQL & NoSQL
- OpenCGA - open-source project for storing variant data and associated metadata in MongoDB
- Snowflake - closed-source distributed SQL engine
Multidimensional array based
SciDB - closed-source platform. Hosts large datasets including UK Biobank.
TileDB-VCF (requires a TileDB-Cloud account to scale) - an open source python package that uses serverless TileDB multidimensional arrays indexed on chr, pos, and sample. TileDB-VCF on TileDB-Cloud powers real-time queries for variant browsers as well as large notebook-based analyses that use task graphs in conjunction with UDFs. Disclaimer: I am the product manager for TileDB-VCF.
These solutions have vastly different performance, flexibility, and portability characteristics, as well as different cost structures, infrastructure needs, and varying levels of support for gVCF ref/no-call ranges (the n+1 problem), SVs, and pangenomic graph-based representations. It seems likely the growing interest in multi-omics - combining analyses of genomic variation with transcriptomics, proteomics, cytomics, and imaging - will also shape the future of variant warehouses.
shameless LLM : https://ai.tinybio.cloud/chat
no, tinyliar, it isn't - ambiguous representations of equivalent sequence variation is a major headache for anyone who uses a VCF.
GREAT example of superficiality of LLM based answer, here (which I have been looking for).
i know what it meant to say, but in reality its one of the main problems with the format, both for the reason you state and because errors in the reference and non-universality of any single reference limit understanding of the immediate context of a variant.
ChatGPT said
Says who? If anyone uses "ideal" in the context of VCF format, or dares to say "no better file format can be conceived", they better have clairvoyance powers
Ram - this was as bit of a dirty tactic on my part: I stated it in the extreme to provoke disagreement. I agree with you - the irony is, despite VCF being the de facto standard, I don't think any would actually agree it is ideal, and to be honest, I was hoping to elicit responses like yours.
Personally, I do not think the
.vcf
format is ideal (honestly, I do not even think it is good)! But I wanted to avoid giving my own opinion till others have replied.