What is currently the best user friendly (visual and interactive) VCF/BCF mining tool in 2021? For VCF/BCF similar to size or even larger than the 1000 human genomes VCF?
I guess most organization do not have a visual and interactive mining VCF mining tool but use either:
- A website front-end + batch query system back-end, submit your query and wait few minutes to hours to get back results. Maybe get no results back, too many results, or wrong results. And then repeat.
- A (junior) bio-informatician that runs a query/few queries on the command line every time a non linux/programming experienced biologist has a question.
I asked this question already around 5 years ago, and wonder what the situation currently is.
So 100M plus variants, 1000+ samples, compressed BCF file size 500G+, uncompressed VCF several TB+
One requirement is that it should do all kinds of filtering that bcftools view does:
http://www.htslib.org/doc/bcftools.html#view
But BCFTools does not meet the interactive and visual requirements. BCFTools is only interactive for small VCF files or when you use the tabix index to limit the query to a small region.
Another requirements if that the filtering is visual and interactive, like for example with a small genotype matrix in Excel. (I know bad idea but at least Excel interactive, visual and biologist friendly).
With interactive I mean that a filter criteria can be adjusted and you semi reall-time (few seconds to 1 minute) get back your updated result genotype matrix. Even for complex queries were the full 100M+ variants for all 1000+ samples should be scanned the tool should be interactive.
Does something like this already exist? If so which tools?
Mostly curious about what open source solution there are, but also curious if there are any commercial solutions?
See also this older question and answers:
Which Type Of Database Systems Are More Appropriate For Storing Information Extracted From Vcf Files
I am/was hoping that nowadays something like the following exists:
- scalable database (cluster) (e.g. mongodb/spark etc) that stores a large VCF/BCF content; variants and genotypes
- bcftools view like domain code could do queries
- results reported (full/paginated or summarized) in a website/fat GUI.
I believe that many use excel or some other software to analyze already annotated VCFs. I know it doesn't really apply to the same type of events, but I recently used jbrowser to analyze a structural variant VCF that has a visual and interactive interface.
Genomebrowsers like Jbrowse/IGV work fine but only for few samples and few variants/regions of interest. Fine if you are already at that level, but not if your still need to get to "small data" (=few regions of interest/few samples of interest).
Okay, maybe work with Hail using Databricks on AWS could be an option in this case.
I have looked at hail in the past, found it can do gwas and pca on large VCF files quick, but not (as far as I know) filter a large VCF file like bcftools view does.
I just found the BGT tool (by Heng Li): https://github.com/lh3/bgt. It's not visual, but seems to allow for very flexible queries. Edit: actually it has a web interface.
Of the visual tools I've only found VIVA (written in Julia), which I mentioned below. It is still under development, but it looks promising.