Question

Sequence Data Management

0

Entering edit mode

7.2 years ago

navela78 ▴ 70

hello I am working on a project to store and manage genome sequence data. Specifically microbial genomes. The tool should store several thousand genomes and need to find and extract sequences (contigs, genes, proteins) quickly. What is the best way to implement this?

I saw this post but it is 5 years old: Describe Your Architecture: Uniprot

My specific questions are 1) Which database is suitable to store sequence metadata (e.g. name, length, species, etc.). Looks like uniprot used berkleydb as their main db and indexed some data in lucene for searching. Correct? 2) Where to store the sequence itself so I can retrieve it quickly?

Looks like Uniprot indexes the metadata in lucene ... but I am not sure how they store sequences ... Also do they use

Any help will be greatly appreciated.

genome • 1.9k views

ADD COMMENT • link updated 7.0 years ago by Biostar 20 • written 7.2 years ago by navela78 ▴ 70

0

Entering edit mode

1) Which database is suitable to store sequence metadata (e.g. name, length, species, etc.)

depends of the number of metadata, sequences, complexity ...

create table(id,name, bases, species_id);
create species(id,name);

would work...

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks for your response Pierre.

The number of metadata is large ... in the order of hundreds of millions. The complexity will be low. It will have simple things like name, length, start coordinate, end coordinate, unique ID. The only complex thing that may be there in the metadata is the taxonomic lineage of that sequence ... e.g. (Streptococcus -> Firmicutes -> Streptococcaceae - > ...)

Which db can handle such large data? When you say bases, are you suggesting that I store the sequence bases in the database itself (as opposed to a flat file)? Usually database systems are not tuned to store such large sequences, correct? Imagine a bacterial chromosome of ~5MB length ... is it good practice to store it in a database?

ADD REPLY • link 7.2 years ago by navela78 ▴ 70

0

Entering edit mode

Do not create an answer when replying to a comment or answer. This makes the questions appear as answered. Use the "add reply" button instead.

ADD REPLY • link 7.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Three points to consider:
1- MySQL/MariaDB and PostgreSQL can both handle tables with millions of rows. I've a MySQL table with ~400 million rows. To get good performance you may need to tune the configuration and it may also be beneficial to split the data according to your access pattern(s). The key to making a database useful is to design and index it properly.
2- You can keep sequences in files and only store the paths to the files in the database. For many applications this is more convenient but this depends on what the downstream applications are.
3- If this is going to be a resource used regularly, you should consider building an API.