Which Type Of Database Systems Are More Appropriate For Storing Information Extracted From Vcf Files
6
16
Entering edit mode
11.7 years ago
alex ▴ 250

Hi all, I am looking into a DBMS for database storage. We have analyzed VCF Files.

http://www.1000genomes.org/node/101

When I say analyzed I mean data that has been aligned with some other steps but is still in a VCF format.

My question is does anyone have any experience with good database systems? Our problem is when you choose an indexing you may be indexing for one question and when you go to ask a different question your query may not be optimized for the next question. We are looking at systems such as Key/Value, RDBMS, Graph, and BigTable. I only have experience with RDBMS systems would be interested in hearing any and all experience on this problem. Thanks!

ngs warehousing vcf genomics • 15k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Somewhat similar but that question deals with whether or not to import VCF. I am saying I am importing it into a DBMS and was wondering what system would be optimal

ADD REPLY
2
Entering edit mode

but isn't that still a very similar question - it may have the word "import" there but importing also implies making use a database system to represent VCF files. Your question as you pose it feels more nebulous. The evaluation of a database system depends solely on the queries that one wishes to perform and from this question we don't know what your needs are.

ADD REPLY
0
Entering edit mode

True, but that is why I am trying to see what other people's needs have been in the past and to see what database system worked for them and why. I am in a new division and have been asked to build out a VCF database for queries and so the questions that will be asked are not very well known.

ADD REPLY
0
Entering edit mode

I think the answer here depends a lot on what sorts of questions you are asking. What do you want to get out of the data? VCF records have always seemed appropriate for a document store (ie, Cassandra or the interesting-but-new HyperDex), but that depends entirely on the questions you want to ask of the data.

ADD REPLY
15
Entering edit mode
11.7 years ago
lh3 33k

I agree with Pierrre and Istvan that your question is essentially the same as the old quesiton asked two years ago. Nonetheless, at that time, tabix had just arrived and there were no BCF2. 1000g VCF files were much smaller than they are now. I will update my comments here.

Neither VCF/BCF2 nor databases are good for everything. VCF/BCF2 is not flexible. The only query you can perform on a VCF, as of now, is to retrieve data from a genomic region. If you want to get a list of rs# with allele frequency below a threshold, you are out of luck - you have to parse the entire VCF, which is slow. On the other hand, generic RDBMS are usually not as efficient as specialized read-only binary formats. With a database, you would need dedicated hardware and non-trivial configurations for retrieving alignments/genotypes in a region, which can be done on a mediocre PC with NGS tools. That is why almost no one loads read sequences to a RDBMS in typical use cases. As to VCF, some preliminary tests done by my colleagues suggest that interval queries in a huge BCF2 on a standard computing node are much faster (over a couple of orders of magnitude as I remember) than the same queries with MongoDB on a dedicated server with huge memory.

In my opinion, you should keep the static primary data in BCF2 or in tabix'd VCF, and keep meta information and summary statistics (e.g. position, allele frequency and rs#) in a RDBMS. When you do not need to retrieve individual genotypes, you can query the database to get what you want. When you need to look up for genotypes at a position or in a region, query BCF2/VCF. This way you get the performance of BCF2 and the flexibility of databases at the same time. More efficient implementation would put the VCF/BCF2 virtual offset in the database, but it will be harder unless you are fairly familiar with the format and index.

On BCF2 vs. tabix'd VCF. BCF2 is by far faster to read because you save the time on parsing VCF, the bottleneck. BCF2 also comes with a better index that has higher performance when a region contains many records. The downside of BCF2 is it is not well supported. The C implementation is largely complete, but GATK only supports uncompressed BCF2; no perl/python/ruby bindings exist so far. Tabix'd VCF works fine for data at the scale of 1000g. If you do not retrieve and parse many records per query, tabix'd VCF may be a better choice for its wide supports.

Note that even BCF2 is not the right solution for up to 1 million samples. New and better solutions will emerge when we approach to that scale.

ADD COMMENT
10
Entering edit mode
11.7 years ago

While in general, I agree with Heng's points, we have found the use of Tabix and vcftools very useful for some analyses but a bit cumbersome for more intricate explorations of genetic variation. We find this to be especially true when one wants to put those variants in broader context through comparisons to many different genome annotations.

Our solution, while not yet final, is a new tool that we have been developing called gemini. Our hope for gemini is for it to be used as a standard framework for exploring genetic variation on both family-based studies of disease and for broader population genetic studies. The basic gist is that you load a VCF file into a Gemini database (SQLite is the backend for portability and flexibility). As each variant is read from the VCF, it is annotated via comparisons to many different genome annotation files (e.g., ENCODE, dbSNP, ESP, UCSC, ClinVar, KEGG). Tabix is used to expedite the comparisons. The variants and the associated annotations are stored in a variants table. The attractive aspect to us is that the database framework itself is the API. Once the data is loaded, one can ask quite complex questions of one's data. With the help of Brad Chapman and Rory Kirchner, we have parallelized the loading step, which is, by far, the slowest aspect. To use multiple cores, one would do:

gemini load -v my.vcf --cores 20 my.db

One can also farm the work out to LSF, SGE, and Torque clusters. Once loaded, one can query the database via the command line:

gemini query -q "select chrom, start, end, ref, alt, \
                         aaf, hwe, in_dbsnp, is_lof, impact, num_het \
                         from variants" \
             my.db

Also, we represent sample genotype information in compressed numpy arrays that are stored as binary BLOB columns to minimize storage and allow scalability (i.e., as opposed to having a genotypes _table_ where the number of rows is N variants * M samples: bad). As such, we have extended the SQL framework to allow struct-like access to the genotype info. For example, the following query finds rare, LoF variants meeting an autosomal recessive inheritance model (note that the --gt-filter option uses Python, not SQL syntax).

gemini query -q “select chrom, start, end,
                         ref, alt, gene, 
                         impact, aaf, gts.kid
                         from variants
                         where in_dbsnp = 0
                         and   aaf < 0.01
                         and   in_omim = 1
                         and   is_lof  = 1”
             --gt-filter 
            “gt_types.mom = HET
            and
            gt_types.dad = HET
            and
            gt_types.kid = HOM_ALT”

There is also a Python interface allowing one to write custom Python scripts that interface with the Gemini DB. An example script can be found at: http://gist.github.com/arq5x/5236411.

There are many built-in tools for things such as finding compound hets, de novo mutations, protein-protein interactions, pathway analysis, etc.

Gemini scales rather well. We recently loaded entire 1000 Genomes VCF (1092 samples) in 26 hours using 30 cores. The queries are surprisingly fast. While this scale is not our current focus (we are more focused on medical genetics), it is encouraging to see it work well with 100s of individuals.

If interested, check out the documentation. Comments welcome.

ADD COMMENT
4
Entering edit mode
11.7 years ago

I wrote a tool to put a VCF in a SQLIte db: http://code.google.com/p/variationtoolkit/wiki/Vcf2Sqlite

I never used it because it's always faster & easier to parse the VCF from scratch with a command line , with a workflow engine (biologists [use][2] knime here), etc... The only database i would write is a db with the path to the tabix-indexed VCF.gz on the server and some links to the associated sample/project.

In my bookmarks: http://www.lovd.nl/2.0/ "Leiden Open (source) Variation Database."

ADD COMMENT
3
Entering edit mode
11.7 years ago

What Heng is describes in his answer is a hybrid, layered solution. Note that such hybrid solutions allow you to support more complex and varied queries (as compared to a single technology) at the expense of potentially increased complexity and data redundancy. For example, you might want to use a graph database to relate genes to diseases and pathways. You might use mongodb for storing gene information in a "document". You might store BCF2 in a relational database to allow arbitrary (but slow) queries and store those processed results into an hdf5 file as a data cube. In the end, the design is driven entirely by the queries and performance constraints.

So, to answer your original question, there is no absolute optimal storage solution for VCF data.

ADD COMMENT
3
Entering edit mode
11.7 years ago
Chris Cole ▴ 800

I think this is a fair question. It's unfair to dismiss it so quickly.

The issue (for me) of plain VCF files is that there's no information regarding the downsteam consequence of a variant mutation. Adding this information via e.g. VEP is critical for understanding the biological significance of the important variants. Querying this in multiple text files is a PITA.

I currently my data in MySQL to allow querying against amino acid changes, Polyphen predictions, etc. It's extremely simple (two tables) at the moment while we get an idea of the kind of questions we want to ask of the data. With that information in hand I can create a better suited schema, if necessary.

The only reason I've used MySQL is that I have experience with it. I believe that a NoSQL solution might work well, but don't really have time to investigate that as an option.

Update

There a new project called gemini which seems to be a good solution;http://gemini.readthedocs.org/en/latest/

ADD COMMENT
1
Entering edit mode
10.7 years ago
Amos ▴ 50

My general inclination has been to follow the same workflow ih3 suggests in the accepted answer, but I stumbled on this recently and was looking around to see if anyone tried the SciDB (fast selects, in-database calc). There is a VCF loader available here https://github.com/slottad/scidb-genotypes

ADD COMMENT

Login before adding your answer.

Traffic: 2352 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6