I am working on a prototype for the below mentioned abstract requirement
- Clients upload *.vcf files into our portal
- The system will process VCF and parse the data to send it into some database(Yet to decide)
- Clients will use the API's to the request for filtered data across patients or for a given patient (Variant data)
We are evaluating what would be best practice in storing genomic data and how the current vendors are doing the same. I am thinking of using MongoDB as there will be a lot of querying on the variants.
I did some research on what is Exome Sequencing, GATK, VCF file formats but before we start with something I wanted to weigh in experts advice. Also, I am not sure how complex the client's queries would become or how practical it would be helpful to the research. These are two questions that I am trying to gather more information before we start with the prototype. Any help or suggestions would be of great value to us. I have less knowledge on Genomics however I am willing to learn anything that is required to accomplish this project.
Thanks a lot for your response. I am going through the application. Below are few more questions on this topic. Also the volume of the data would be growing high every year. We are looking at processing 10000+ samples (Exome) and the outcomes would be VCF which will be given to us by the clients with some upload portals etc. Our work starts from here. 1. We are looking for a way to parse data with good quality parser in the market that can be reused here. 2. My struggle is with right set of tech stack required to manage such kind of transactions
Just to add to the edit I made to the answer: The processing of the VCFs doesn't have to be straight away, you could set up a scheduler to process the files later, and probably just as important to re-annotate the variants on an ongoing basis (keeping the original version).