VCF to human friendly form?
2
0
Entering edit mode
2.3 years ago
wormball ▴ 10

Hello!

As we all know, VCF format is human readable. But in my opinion it is not human friendly. MAF is slightly better but not a lot. For example, if i want to see which mutations are present in which samples, i have to browse tons of text in vcfs for all my samples, cos if i merge those vcfs, most of information will be lost. And if i want to merge these vcfs or process them in some other way, i have to use command line tools (or even programming), which requires some skills and effort.

Are there any tools which can e. g. produce a table from several samples with mutations colored by some expression? Or even annotate which mutation is possibly carcinogenic? Or do other human friendly things with vcfs.

Which tools do you use to not get lost in vcfs?

Thanks in advance.

vcf visualization • 1.8k views
ADD COMMENT
4
Entering edit mode

Whilst I appreciate VCFs can be quite daunting, the issue with these kind of GUI based browsers is that if they don't have the exact functionality you need, there's nothing you can do about it. Learning a tiny bit of bash and how to use bcftools query (I promise it's easy and occasionally possibly fun) is hugely more flexible and will save you a bunch of time in the long run.

ADD REPLY
1
Entering edit mode

There are likely commercial tools that probably address this but you are likely looking for free software

Limited version is free: https://www.goldenhelix.com/products/VarSeq/viewer-download.html

So is: https://www.goldenhelix.com/products/GenomeBrowse/index.html

ADD REPLY
6
Entering edit mode
2.3 years ago

I wrote vcf2table: http://lindenb.github.io/jvarkit/VcfToTable.html

I also wrote https://lindenb.github.io/jvarkit/SwingVcfView.html

ADD COMMENT
0
Entering edit mode

Sounds very interesting! However, your image links seem to be broken. Are there alternative image links or a link to an exemplary HTML report of the VcfToTable tool?

ADD REPLY
1
Entering edit mode

your image links seem to be broken

yeah, I left twitter.

enter image description here

ADD REPLY
3
Entering edit mode
2.3 years ago
d-cameron ★ 2.9k

Which tools do you use to not get lost in vcfs?

For most analysis, determining what variants are present in the sample is merely the start of the analysis. What you do with this information is entirely dependent on what you want to get out of your analysis. There are many tools that perform a variety of analyses based on a VCF and which you should run, and what analysis you need to do yourself depends entirely on your question and why you sequenced your samples in the first place. Is this a clinical test? If research, what question are you trying to answer? It all depends.

Or even annotate which mutation is possibly carcinogenic?

There's a class of tools known as as Variant Effect Predictor (VEP) tools that determine the protein impact of each mutation. These typically add additional annotation fields to the VCF so you're still in VCF-land.

There's a bunch of oncology papers and databases that contain information about mutations that pre-dispose carriers to cancer (e.g. BRCA mutations).

There's many many papers about mutation X being associated (or driving) phenotype/disease Y. Genome-wide association studies (GWAS) identify genes associated with a particular disease/trait and take VCFs as input.

Again, it all depends on what you want to know.

Or do other human friendly things with vcfs.

There's plenty of systems that have VCF as a variant representation format that have human-readable output. I'll use a system I was involved in as an example. https://oncoact.nl/how-does-oncoact-work/?lang=en#patientrapport goes from tumour/normal sequencing to a oncology patient report that summaries cancer treatment options based on the mutations found in the sample. Generation the VCF is only a small part (albeit, the most computationally expensive one) of this process. Generating the report requires identifying the clinically actionable variants, looking up multiple database, variant effect prediction, multiple intermediate VCFs ( SNV/indel, SV/CNV + additional tools such as MSI status & viral integration detection), variant prioritisation, and a bunch of other things I'm sure I've missed.

But in my opinion it is not human friendly.

It's meant to be human-readable, but it's never going to be human friendly because there's so much information in there.

VCF is designed to be a variant interchange format so tools work together. As a contributor to the VCF specifications, I honestly don't care if VCF is difficult to read - the only time you should be looking at a VCF file for analysis purposes is to see what fields it contains. If you're looking deep into VCF files then you're missing the analysis part of your analysis pipeline and wasting your time doing something manually that could be done so much faster by a program. What matters for VCF is the universal interop of tools that read/write VCF.

For example, if i want to see which mutations are present in which samples, i have to browse tons of text in vcfs for all my samples, cos if i merge those vcfs, most of information will be lost.

All major bioinformatics languages have VCF parsing libraries (e.g. htslib (C/C++), htsjdk (java), pyVCF, VariantAnnotation (R/BioConductor), noodles (Rust) ). You definitely shouldn't be doing this by hand/manual inspection. If you're doing bioinformatics analysis, I strongly recommend learning some programming - the above operations could be done in R in around 5 lines of code. Alternatively, you might be able to use vcftools or bcftools (different programs) to convert your single sample VCFs into a multi-sample VCF.

TLDR: you're missing the analysis part of your analysis pipeline. What tools/program/code you need depends entirely on what that analysis actually is.

ADD COMMENT

Login before adding your answer.

Traffic: 1645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6