Question

Storing Varaint data from VCF

2

Entering edit mode

5.7 years ago

mvkotekar ▴ 20

I am working on a prototype for the below mentioned abstract requirement

Clients upload *.vcf files into our portal
The system will process VCF and parse the data to send it into some database(Yet to decide)
Clients will use the API's to the request for filtered data across patients or for a given patient (Variant data)

We are evaluating what would be best practice in storing genomic data and how the current vendors are doing the same. I am thinking of using MongoDB as there will be a lot of querying on the variants.

I did some research on what is Exome Sequencing, GATK, VCF file formats but before we start with something I wanted to weigh in experts advice. Also, I am not sure how complex the client's queries would become or how practical it would be helpful to the research. These are two questions that I am trying to gather more information before we start with the prototype. Any help or suggestions would be of great value to us. I have less knowledge on Genomics however I am willing to learn anything that is required to accomplish this project.

variant ngs techstack database • 3.5k views

ADD COMMENT • link updated 5.7 years ago by floris.barthel ▴ 50 • written 5.7 years ago by mvkotekar ▴ 20

score 2 · Answer 1 · 2019-03-28

I have been using a PostgreSQL database for storing sample metadata and variant calls. This works well for datasets with tens of millions of rows and range intersections are blazing fast using Postgres's int4range type.

I'm starting to notice serious performance degradation (not scaling linearly to the number of rows in terms of time to complete transactions) with larger tables that have billions of rows. Not quite able to pinpoint the problem but suspect insufficient system specs at this point.

score 1 · Answer 2 · 2019-03-05

gnomad uses mongodb https://github.com/macarthur-lab/gnomad_browser

some (crazy) colleagues are trying to use a RDF database (would be huuuge but nice)

see also: Which Type Of Database Systems Are More Appropriate For Storing Information Extracted From Vcf Files ;: Do People Import Vcf Files Into Databases? ; Vcf And Mongodb ; ....

score 0 · Answer 3 · 2019-03-06

I developed something similar a long time ago and you're welcome to have a look https://github.com/pasted/clinical_variant_database

It can probably be improved alot. Things to note:

Variants can be of any size so you probably need a start / stop field for the genomic coordinates
The coordinates are specific for a given reference genome so include that relationship in the tables
I used a hashstore to store non-relational data (instead of going the MongoDB route), found it useful
You can integrate alot of external resources via their APIs, caching the results locally for speed - if this is clinical data then OMIM / Clinvar results could be useful or cancer specific databases if this is the focus instead.
If this is patient data you will have to look at encryption even if anonymized
Run and quality metrics are useful, you might find that a patient may have multiple analysis runs and so more than one variant at the same position in the database.
Remember user accounts and security levels for access to the data, clinical staff should not see details of patients assigned to other staff; however you do want to be able to see summary stats for each variant.
Integrate Gnomad / ExAC for population level frequency (differs by "ethnicity" and even within "ethnicity" - best not to rely on self-reported, possibly cluster via 1000genomes etc)
Pull in new data from large deep phenotyping studies such as UK Biobank, if possible
Other resources include GTeX / COSMIC / GWAScatalog
Good luck!

Best not to use my code directly as it now has a number of security issues, and needs a complete rewrite.

edit I used Sidekiq as the scheduler for the app https://github.com/mperham/sidekiq, although there are alot of other options depending on the platform / language used. Python has a number of VCF parsers including PyVCF and vcf_parser, Clojure has Bioclojure, Java has htsjdk VCFFileReader, Ruby has Bioruby, bioruby-vcf (although the github build appears to be failing and I'm not sure if this is still maintained). For speed maybe something like Rust - biorust or C - htslib. There's also Julia - biojulia, Nim - htsnim, plus lots more.

I used the Rails framework due to previous experience, but Python has Django and there's many other web frameworks available depending on what you feel comfortable with.