Entering edit mode
3.7 years ago
Debut
▴
20
Hello I am looking to build a database to store bacterial genomes, I don't know if I should do it with a relational database management system (SQL) or a non-relational database management system (like NoSqL). Could someone please guide me?
it depends what kind of data you want to be stored
Thank you for your answer. genomic sequences and protein sequences organized by bacterial species
What kind of information you are planning to store?
Thank you for your answer. genomic sequences and protein sequences organized by bacterial species
without any other requirement, you only need one table.
what kind of language is it please
I'm sorry - how are you exploring SQL/NoSQL options when a basic SQL statement is foreign to you? How do you intend on understanding architectural pros and cons of using one approach versus the other, and how would you implement the one you pick?
It's a SQL statement that will create a table named
myDatabase
with three columns taxon, seqtype and seq.What exactly do you want to do? Be able to query the sequences/find specific ID's?
These sequence files generally are available as a single gene or protein multi-fasta file per genome and they could simply be put in species specific folders to segregate them if you just want to keep them organized.
I would like to make a database that stores genomic sequences and protein sequences of several species of bacteria
I want to make a graphical interface, when a person presses on the name of the species in the dropdown list and then on a button, this person will be able to retrieve directly the fasta files with the genomic sequences. or he will be able to choose also to retrieve the protein sequences.
This obviously does not require a database if you are going to expect the person to get the full sequence file.
What you are describing is a web application (that can have a database behind it). You could simply have deep links to the relevant files to NCBI sites (logical) or you could download the files locally (not trivial would need space).
NCBI already makes such a genome browser available here.
the goal is to have the sequences already uploaded on the server and to retrieve them directly.
So, download from NCBI, upload to your local server and re-download them on demand? Why?
no just download them locally and put them in a database and then make the GUI
Will the GUI allow the user to "retrieve" the sequence? If so, where from? In a broader context, why are you building this GUI application. Do bacterial genome browsers not exist?
when we choose the species, behind it we get the sequences downloaded for this species
for example for klebsiella pneumoniae he asks me to have asks to have the 10 703 sequences in "https://www.ncbi.nlm.nih.gov/genome/browse/#!/prokaryotes/815/" at the level of "Genome Assembly and Annotation report (10703)".
Is this an exercise or assignment? If not I don't get the point of replicating GenBank.
It's for an internship. They want to make a graphical interface for biologists that is easier than on NCBI
Do you have to build from scratch? There are existing solutions like Tripal that could be used.
For reasons of learning and didactics, I recommend sticking with (simple) SQL, it has many applications, is standardized, has a solid mathematical foundation, and you learn some transferable concepts for life. With noSQL, which is not really the "opposite of SQL" you can only use that exact same noSQL system you trained on.
I'd try SQLite for starters, that's good enough to study some basic concepts of SQL without complex client server architecture. Then develop a simplistic data model, and after that a simplistic web interface as a prototype. Discuss your solution with your stakeholders often and be agile in incorporating changes.
Expect a somewhat steep learning curve, but nothing (and in particular not noSQL solutions) can spare you this.
I made this database in SQL. What do you think?
create table Genome ( ID interger NOT NULL AUTO_INCREMENT, name varchar (300) NOT NULL, sequence text, numero_accesion text);
create table gene ( ID integer NOT NULL AUTO_INCREMENT, name text, sequence text FOREIGN KEY (ID_genome) REFERENCES genome (ID));
create table proteine ( ID integer NOT NULL AUTO_INCREMENT, name text, sequence text FOREIGN KEY (ID_gene) REFERENCES gene (ID));
create table Espece( ID integer NOT NULL AUTO_INCREMENT PRIMARY KEY, name_latin text, FOREIGN KEY (ID_Genome) REFERENCES genome (ID));
how could I implement it with several fasta files please? thanks
Google "bulk upload to SQL". We cannot walk you through your internship.
Might work for starters. I would:
species
fromgenome
not the other way aroundYou can import csv formatted data into SQLite tables directly if the csv file has the same format as the table, you need to re-format fasta files using a script or you have to write your own importer using a SQLite database backend library.
After that it’s time to test your database and find out if and how you can make the queries you need.