Storing Varaint data from VCF
3
2
Entering edit mode
5.7 years ago
mvkotekar ▴ 20

I am working on a prototype for the below mentioned abstract requirement

  1. Clients upload *.vcf files into our portal
  2. The system will process VCF and parse the data to send it into some database(Yet to decide)
  3. Clients will use the API's to the request for filtered data across patients or for a given patient (Variant data)

We are evaluating what would be best practice in storing genomic data and how the current vendors are doing the same. I am thinking of using MongoDB as there will be a lot of querying on the variants.

I did some research on what is Exome Sequencing, GATK, VCF file formats but before we start with something I wanted to weigh in experts advice. Also, I am not sure how complex the client's queries would become or how practical it would be helpful to the research. These are two questions that I am trying to gather more information before we start with the prototype. Any help or suggestions would be of great value to us. I have less knowledge on Genomics however I am willing to learn anything that is required to accomplish this project.

variant ngs techstack database • 3.5k views
ADD COMMENT
2
Entering edit mode
5.7 years ago

I have been using a PostgreSQL database for storing sample metadata and variant calls. This works well for datasets with tens of millions of rows and range intersections are blazing fast using Postgres's int4range type.

I'm starting to notice serious performance degradation (not scaling linearly to the number of rows in terms of time to complete transactions) with larger tables that have billions of rows. Not quite able to pinpoint the problem but suspect insufficient system specs at this point.

ADD COMMENT
1
Entering edit mode
5.7 years ago

gnomad uses mongodb https://github.com/macarthur-lab/gnomad_browser

some (crazy) colleagues are trying to use a RDF database (would be huuuge but nice)

see also: Which Type Of Database Systems Are More Appropriate For Storing Information Extracted From Vcf Files ;: Do People Import Vcf Files Into Databases? ; Vcf And Mongodb ; ....

ADD COMMENT
0
Entering edit mode
5.7 years ago
Garan ▴ 690

I developed something similar a long time ago and you're welcome to have a look https://github.com/pasted/clinical_variant_database

It can probably be improved alot. Things to note:

  • Variants can be of any size so you probably need a start / stop field for the genomic coordinates
  • The coordinates are specific for a given reference genome so include that relationship in the tables
  • I used a hashstore to store non-relational data (instead of going the MongoDB route), found it useful
  • You can integrate alot of external resources via their APIs, caching the results locally for speed - if this is clinical data then OMIM / Clinvar results could be useful or cancer specific databases if this is the focus instead.
  • If this is patient data you will have to look at encryption even if anonymized
  • Run and quality metrics are useful, you might find that a patient may have multiple analysis runs and so more than one variant at the same position in the database.
  • Remember user accounts and security levels for access to the data, clinical staff should not see details of patients assigned to other staff; however you do want to be able to see summary stats for each variant.
  • Integrate Gnomad / ExAC for population level frequency (differs by "ethnicity" and even within "ethnicity" - best not to rely on self-reported, possibly cluster via 1000genomes etc)
  • Pull in new data from large deep phenotyping studies such as UK Biobank, if possible
  • Other resources include GTeX / COSMIC / GWAScatalog
  • Good luck!

Best not to use my code directly as it now has a number of security issues, and needs a complete rewrite.

edit I used Sidekiq as the scheduler for the app https://github.com/mperham/sidekiq, although there are alot of other options depending on the platform / language used. Python has a number of VCF parsers including PyVCF and vcf_parser, Clojure has Bioclojure, Java has htsjdk VCFFileReader, Ruby has Bioruby, bioruby-vcf (although the github build appears to be failing and I'm not sure if this is still maintained). For speed maybe something like Rust - biorust or C - htslib. There's also Julia - biojulia, Nim - htsnim, plus lots more.

I used the Rails framework due to previous experience, but Python has Django and there's many other web frameworks available depending on what you feel comfortable with.

ADD COMMENT
0
Entering edit mode

Thanks a lot for your response. I am going through the application. Below are few more questions on this topic. Also the volume of the data would be growing high every year. We are looking at processing 10000+ samples (Exome) and the outcomes would be VCF which will be given to us by the clients with some upload portals etc. Our work starts from here. 1. We are looking for a way to parse data with good quality parser in the market that can be reused here. 2. My struggle is with right set of tech stack required to manage such kind of transactions

ADD REPLY
0
Entering edit mode

Just to add to the edit I made to the answer: The processing of the VCFs doesn't have to be straight away, you could set up a scheduler to process the files later, and probably just as important to re-annotate the variants on an ongoing basis (keeping the original version).

ADD REPLY

Login before adding your answer.

Traffic: 1811 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6