hello I am working on a project to store and manage genome sequence data. Specifically microbial genomes. The tool should store several thousand genomes and need to find and extract sequences (contigs, genes, proteins) quickly. What is the best way to implement this?
I saw this post but it is 5 years old: Describe Your Architecture: Uniprot
My specific questions are 1) Which database is suitable to store sequence metadata (e.g. name, length, species, etc.). Looks like uniprot used berkleydb as their main db and indexed some data in lucene for searching. Correct? 2) Where to store the sequence itself so I can retrieve it quickly?
Looks like Uniprot indexes the metadata in lucene ... but I am not sure how they store sequences ... Also do they use
Any help will be greatly appreciated.
depends of the number of metadata, sequences, complexity ...
would work...
Thanks for your response Pierre.
The number of metadata is large ... in the order of hundreds of millions. The complexity will be low. It will have simple things like name, length, start coordinate, end coordinate, unique ID. The only complex thing that may be there in the metadata is the taxonomic lineage of that sequence ... e.g. (Streptococcus -> Firmicutes -> Streptococcaceae - > ...)
Which db can handle such large data? When you say bases, are you suggesting that I store the sequence bases in the database itself (as opposed to a flat file)? Usually database systems are not tuned to store such large sequences, correct? Imagine a bacterial chromosome of ~5MB length ... is it good practice to store it in a database?
Do not create an answer when replying to a comment or answer. This makes the questions appear as answered. Use the "add reply" button instead.
Three points to consider:
1- MySQL/MariaDB and PostgreSQL can both handle tables with millions of rows. I've a MySQL table with ~400 million rows. To get good performance you may need to tune the configuration and it may also be beneficial to split the data according to your access pattern(s). The key to making a database useful is to design and index it properly.
2- You can keep sequences in files and only store the paths to the files in the database. For many applications this is more convenient but this depends on what the downstream applications are.
3- If this is going to be a resource used regularly, you should consider building an API.
While it's not a database, I've had success using HDF5 for various projects.