Sequence Data Management
0
0
Entering edit mode
7.2 years ago
navela78 ▴ 70

hello I am working on a project to store and manage genome sequence data. Specifically microbial genomes. The tool should store several thousand genomes and need to find and extract sequences (contigs, genes, proteins) quickly. What is the best way to implement this?

I saw this post but it is 5 years old: Describe Your Architecture: Uniprot

My specific questions are 1) Which database is suitable to store sequence metadata (e.g. name, length, species, etc.). Looks like uniprot used berkleydb as their main db and indexed some data in lucene for searching. Correct? 2) Where to store the sequence itself so I can retrieve it quickly?

Looks like Uniprot indexes the metadata in lucene ... but I am not sure how they store sequences ... Also do they use

Any help will be greatly appreciated.

genome • 1.9k views
ADD COMMENT
0
Entering edit mode

1) Which database is suitable to store sequence metadata (e.g. name, length, species, etc.)

depends of the number of metadata, sequences, complexity ...

create table(id,name, bases, species_id);
create species(id,name);

would work...

ADD REPLY
0
Entering edit mode

Thanks for your response Pierre.

The number of metadata is large ... in the order of hundreds of millions. The complexity will be low. It will have simple things like name, length, start coordinate, end coordinate, unique ID. The only complex thing that may be there in the metadata is the taxonomic lineage of that sequence ... e.g. (Streptococcus -> Firmicutes -> Streptococcaceae - > ...)

Which db can handle such large data? When you say bases, are you suggesting that I store the sequence bases in the database itself (as opposed to a flat file)? Usually database systems are not tuned to store such large sequences, correct? Imagine a bacterial chromosome of ~5MB length ... is it good practice to store it in a database?

ADD REPLY
0
Entering edit mode

Do not create an answer when replying to a comment or answer. This makes the questions appear as answered. Use the "add reply" button instead.

ADD REPLY
0
Entering edit mode

Three points to consider:
1- MySQL/MariaDB and PostgreSQL can both handle tables with millions of rows. I've a MySQL table with ~400 million rows. To get good performance you may need to tune the configuration and it may also be beneficial to split the data according to your access pattern(s). The key to making a database useful is to design and index it properly.
2- You can keep sequences in files and only store the paths to the files in the database. For many applications this is more convenient but this depends on what the downstream applications are.
3- If this is going to be a resource used regularly, you should consider building an API.

ADD REPLY
0
Entering edit mode

While it's not a database, I've had success using HDF5 for various projects.

ADD REPLY

Login before adding your answer.

Traffic: 1942 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6