Forum:Is there a file format better suited for the era of pangenomics than the .vcf? What are its attributes?
4
6
Entering edit mode
17 months ago
LauferVA 4.5k

Consider the following (deliberately extremely phrased) statement about the .vcf file format (.pdf of vcf specs at https:// samtools . github . io).

.vcf is the ideal file format for genomic information; no better file format can be conceived whether in the era of singular linear reference genomes, or in the era of pangenomics to come.

1) Most simply, agree or disagree, and why?

2) If you do disagree, does that disagreement have anything to do with the shift from a singular, linear reference to a pangenome reference, or not?

3) Assuming you disagree, what would you recommend be used in place of a .vcf file?

4) Finally, whether you agree or not, what examples would you furnish to support your view?

VCF file-format genomics • 5.7k views
ADD COMMENT
1
Entering edit mode

shameless LLM : https://ai.tinybio.cloud/chat

The statement "VCF is the ideal file format for genomic information; no better file format can be conceived whether in the era of singular linear reference genomes, or in the era of pangenomics to come" is not entirely true. While VCF (Variant Call Format) is a widely used and standardized format for representing SNP, indel, and structural variation calls, it has some limitations, especially when it comes to storing pangenomic data.

Pros of using VCF:

Standardized format: VCF is widely adopted and supported by many bioinformatics tools, making it easy to integrate into data processing pipelines. Explicit representation: VCF is very explicit about the exact type and sequence of variation, as well as the genotypes of multiple samples for this variation.

Cons of using VCF for pangenomics data:

Scalability: VCF was designed for linear reference genomes and may not scale well for pangenomic data, which can include multiple reference genomes and graph-based representations. Limited representation: VCF may not be able to represent complex genomic variations and rearrangements that can be found in pangenomic data.

As an alternative to VCF, you can consider using more advanced file formats like GFA (Graphical Fragment Assembly) or VG (Variant Graph). These formats are designed to handle pangenomic data and can represent complex genomic variations and graph-based structures.

For example, a simple GFA file representing a small genomic variation graph can look like this:

S 1 ACGT
S 2 TGCA
S 3 CAGT
L 1 + 2 + 2M
L 1 + 3 + 2M
L 2 + 3 + 2M

In this example, there are three segments (S lines) representing genomic sequences and three links (L lines) representing connections between the segments. This format can be extended to represent more complex pangenomic data.

In conclusion, while VCF is a widely used and standardized format for genomic information, it may not be the best choice for pangenomic data. More advanced file formats like GFA or VG can provide better support for complex genomic variations and graph-based representations.

ADD REPLY
2
Entering edit mode

"Explicit representation: VCF is very explicit about the exact type and sequence of variation"

no, tinyliar, it isn't - ambiguous representations of equivalent sequence variation is a major headache for anyone who uses a VCF.

ADD REPLY
0
Entering edit mode

GREAT example of superficiality of LLM based answer, here (which I have been looking for).

i know what it meant to say, but in reality its one of the main problems with the format, both for the reason you state and because errors in the reference and non-universality of any single reference limit understanding of the immediate context of a variant.

ADD REPLY
0
Entering edit mode

ChatGPT said

VCF files can be extended to represent pangenome variant data by including information specific to each haplotype or genome sequence. This can be done using the sample or individual column headers to indicate different haplotypes or genomes in the pangenome.

ADD REPLY
0
Entering edit mode

.vcf is the ideal file format for genomic information; no better file format can be conceived whether in the era of singular linear reference genomes, or in the era of pangenomics to come.

Says who? If anyone uses "ideal" in the context of VCF format, or dares to say "no better file format can be conceived", they better have clairvoyance powers

ADD REPLY
0
Entering edit mode

Ram - this was as bit of a dirty tactic on my part: I stated it in the extreme to provoke disagreement. I agree with you - the irony is, despite VCF being the de facto standard, I don't think any would actually agree it is ideal, and to be honest, I was hoping to elicit responses like yours.

Personally, I do not think the .vcf format is ideal (honestly, I do not even think it is good)! But I wanted to avoid giving my own opinion till others have replied.

ADD REPLY
8
Entering edit mode
17 months ago

VCF is ok as a suitcase for small-scale variation and to a lesser extent, annotation. But you can't live out of your suitcase forever.

VCF isn't a database, and will never support region and sample queries at scale or at "web-speed" in the era of national biobanks. Even its usefulness in transmitting variants is not sustainable past a few thousand samples. Annotation can also be problematic given that everything needs to be serialized into the INFO field. The shift away from joint genotyping and toward single sample gVCFs as the preferred currency further muddies the waters.

There are three or four major types of successors to VCFs as variant warehouses that are worth mentioning.

Spark-based (requires a Spark cluster to scale):

  • hail.MatrixTable - based on Parquet. Hail powers a number of analyses on gnomAD, UK Biobank, and other large genomic datasets.
  • Glow Spark Dataframe - based on Spark and DeltaLake, Glow offers GloWGR, a distributed version of the regenie GWAS package. Provides user-defined functions (UDFs) and variant normalization functions.

Cloud-vendor managed solutions

Distributed SQL & NoSQL

  • OpenCGA - open-source project for storing variant data and associated metadata in MongoDB
  • Snowflake - closed-source distributed SQL engine

Multidimensional array based

  • SciDB - closed-source platform. Hosts large datasets including UK Biobank.

  • TileDB-VCF (requires a TileDB-Cloud account to scale) - an open source python package that uses serverless TileDB multidimensional arrays indexed on chr, pos, and sample. TileDB-VCF on TileDB-Cloud powers real-time queries for variant browsers as well as large notebook-based analyses that use task graphs in conjunction with UDFs. Disclaimer: I am the product manager for TileDB-VCF.

These solutions have vastly different performance, flexibility, and portability characteristics, as well as different cost structures, infrastructure needs, and varying levels of support for gVCF ref/no-call ranges (the n+1 problem), SVs, and pangenomic graph-based representations. It seems likely the growing interest in multi-omics - combining analyses of genomic variation with transcriptomics, proteomics, cytomics, and imaging - will also shape the future of variant warehouses.

ADD COMMENT
1
Entering edit mode

But you can't live out of your suitcase forever.

I'd compare it to a local department store vs Walmart - I think cloud/distributed storage based solutions only come into play when the number of samples is at least in the thousands and the data is frozen. For example, (at least as of 2017) hail needs to create a variant database from the VCF, after which querying becomes easy. The initial loading takes quite a while no matter how small the change.

ADD REPLY
0
Entering edit mode

OK hospitals in particular definitely don't want to regenerate anything as individual new samples are added to the variant store. Even the allele frequencies need to be updated rapidly and automatically.

ADD REPLY
1
Entering edit mode

Exactly. Even when we were trying to leverage hail at my previous job, we ran into troubles constantly with data freezes. IMO academia is not well suited for WORM type data stores

ADD REPLY
1
Entering edit mode

Do you any of these variant warehouses support complex queries like can be done via e.g. bcftools (view, norm, filter (expression), annotate, consensus) on VCF/BCF files? Any successor of VCF/BCF needs to offer both the storage and the query/analysis functionality that works out of the box on top VCF/BCF.

Last time that I checked (years ago) running queries via BCF(Tools) was much faster than a solution on Spark. Also note that BCF(binary) is much faster than VCF(text).

ADD REPLY
1
Entering edit mode

Great question. Yes, they all support those queries with varying levels of elegance and speed. Without naming names, I will say some of these warehouses are taking lossy shortcuts. Some are scalable but not really performant, as you mentioned. Others are just a headache to setup, or lock you in to a vendor. I can show you how to perform all of those queries and post-hoc annotations on TileDB if you are genuinely curious. Just PM me.

Going back to this legendary thread you started, I agree for some of these Spark platforms VCF is something of an afterthought because at the time they figured secondary analysis was a more pressing need. As it turned out, most of the current advances in rapid secondary analysis is now in GPU and FPGA accelerated software.

No one wants to deal with a 585M variant x 500k sample VCF/BCF. In that sense bcftools is no longer relevant in the biobank-era discussion of variant analysis.

ADD REPLY
0
Entering edit mode

I dont think this is a fair question. Though the solutions Jeremy describes may be robust in these ways, I think we as a community need to dichotomize a file format from the infrastructure supporting it if we are to innovate/advance. i.e., the ecosystem of tools built around a data file format is a stronger predictor of its continued use than the quality of the format itself for any given set of metrics used to measure its efficacy.

One should not hold up-and-coming, but conceptually cleaner, approaches to the same rubric as the de facto standard adopted by the field (no matter how good or bad).

Rather, the entire problem is that a new form of representation has to overcome the massive hurdle that is not having any well-known, well-distributed tools available to analyze it!!! If any of the options Jeremy Leipzig mentioned had the same kind of ecosystem behind it, the .vcf would disappear faster than Sanger in 2004.

On the flip side, there are many problems that the .vcf file not only does not, but CANNOT EVER, solve. For instance, the question "does this SNV or indel lay near enough to another population-specific (that is to say, currently non-reference, and therefore less likely to be detected) variant that a TAD (topologically associated domain) is disrupted?" Without genomic context - which is regarded as a given by the .vcf format but will increasingly be shown not to be - one cannot answer these questions! The best one can do is GATK alternateFASTA reference maker, which mis-assigns reference genomic structure to all people alike at a given locus, regardless of their local ancestry.

Whatever the correct answer to this question is, it presupposes at minimum the primary sequence of not only the variant positions, but also the genomic context, which differs between human populations, etc.

ADD REPLY
7
Entering edit mode
17 months ago
LChart 4.5k

As a bit of a history: the VCF (and to an extent even the .bam format) came out of the research groups working on the 1000G project because, internally, alignments, variants, and positions needed to be compared between methods; and it simply became too much of a headache to parse several different formats. The use of tags within the .bam spec (and INFO/FORMAT fields within VCF) derive in part from needing a flexible standard to accommodate method-specific statistics that certain technologies or pipelines require (e.g., color space from the old SOLiD machines, alignment scores from particular aligners). As such the primary design consideration was interoperability for all small-variant discovery use cases. This meant that every method coming out of the consortium (which included samtools bwa freebayes gatk and others) used these formats, and it certainly helped drive adoption that some of the major sequencing centers at the time were involved in the project, and subsequently "committed" to the .bam/.vcf formats.

I expect something similar to happen for pangenomes, where a consortium effort will decide on a "good enough" format that covers the use-cases specific to that project, and the availability of large datasets and high-quality tools specific to that format will drive adoption for most cases; and edge-cases will have to be "stuffed into" that format, or convert to a secondary format (as is the case for.bgen/.pgen).

ADD COMMENT
0
Entering edit mode

interesting answer, do you have any source that discusses this where i could read more?

ADD REPLY
0
Entering edit mode

Not in the form of text, but I can point you to the people (besides myself) who were involved in the low-level discussions at that time, if you're interested in writing a history of the genomics format wars.

ADD REPLY
0
Entering edit mode

sure if you think they would be willing to discuss. thanks!

VAL

ADD REPLY
0
Entering edit mode

Sure! let me know how to contact you off-forum.

ADD REPLY
0
Entering edit mode

Apologies - my email address and twitter are listed on my profile page.

ADD REPLY
3
Entering edit mode
17 months ago

I can't help but be reminded of this XKCD strip.

  1. Disagree. In my humble opinion, there is no such thing as an ideal data format, because one will inevitably struggle with conflicting objectives (e.g. compression, enabling efficient queries, flexibility, interoperability). But for a plain text file format, I think it is pretty well conceived. One should also keep in mind that it is supported by many tools and that any new format will have a hard time, at first, until it has been broadly adopted.
  2. Rather not, but I have too little knowledge about the pangenome tools and references to be able to judge. I presume, that the problems with present-day multi-sample VCFs and comparing variation among two cohorts is not fundamentally different from comparing against a pangenome reference.
  3. Depending on the queries that need to be run, storing the information in one of the many general-purpose graph databases / data formats out there might work (maybe as weighted, but undirected graph?). I assume, the developers of Tile.DB already racked their brain over this.
  4. There is no example that I can think of right now.
ADD COMMENT
0
Entering edit mode

interesting answer and i like the reference to tile.DB.

with regard to 2. - this is the only place you and i might differ .. i think the problem stems not so much from comparing 2,3,4 or n cohorts as from the absence of a singular, linear reference in the first place. related is how to consider the problem of genomic coordinates, which also don't fit as neatly into a pangenomic world as they did in a singular linear one..

ADD REPLY
0
Entering edit mode

I see, how the importance of genomic coordinates and fixed-length scaffolds decreases if pangenomic references are to become the norm, yet I think for single individuals or samples, they will still be relevant.

After all, the pangenomic complexity is hardly relevant for e.g. expressing aberrations within a tumour sample (except chromothripsis cases). During analysis, the pangenomic information from large cancer sequencing consortia will certainly be used for scoring and filtering variants and eliminating false positives, but I posit the final result will still be expressed with respect to a linear reference genome for the clinicians?

ADD REPLY
3
Entering edit mode
17 months ago
d-cameron ★ 2.9k

1) Most simply, agree or disagree, and why?

Disagree

2) If you do disagree, does that disagreement have anything to do with the shift from a singular, linear reference to a pangenome reference, or not?

VCF wasn't the best file format even for a linear reference genome. It's particularly poor for structural variation and one can clearly see by the design that it was built for SNVs and not SVs.

3) Assuming you disagree, what would you recommend be used in place of a .vcf file?

A format that has widespread support and adoption (that's what VCF has going for it now and is one of the most important features of a standard):

  • Native binary format (instead of text+gzip)
  • Good compression and sub-linear scaling (i.e. usable with 1M+ samples)
  • No redundant information (c.f. VCF breakpoint records)
  • Haplotype-centric format, not variant-centric
  • Single unique representation of haplotypes
  • Support for both simple events as well as STRs, VNTRs, and complex rearrangements
  • Support for ambiguously placed variants (i.e. I know there's a SNP, but I don't know if it's in SMN1 or SMN2)
  • Graph-centric format
  • no-call vs ref-call separation (e.g. gVCF)

I have not yet come across a format that satisfies these requirements.

4) Finally, whether you agree or not, what examples would you furnish to support your view?

I'm a VCF specifications maintainer - I'm well aware of how terrible it is/can be as a file format. Have you ever tried to work with complex SVs in VCF? Other poorly performing edge cases are scaling to 100k+ samples and populating the GL field for a strawberry (or any other high-ploidy sample).

ADD COMMENT
0
Entering edit mode

@d-cameron: how close does GFA2 come to meeting these requirements?

I think pretty close, right?

ADD REPLY

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6