Question

Webframework For Biological Data

5

Entering edit mode

14.1 years ago

Snowflake ▴ 50

Hello!

We are doing RNAi based screenings. There, we generate a lot of data for GenXY of different types (many microscopically images, many continuously growing data from different experiments...). In addition to this, there are a lot of sequence, expression, functional data information existing from other sources, which needs to be linked to our tested genes. At the moment, all these data are stored in hundreds of excel files. Now, we would like to create a database out of the files and display all information in a proper way. Since I know Python very well, I thought about using Django. Do you thing that Django is suitable for this? Does anybody used Django successfully for such a purpose? We also would like to include some applications, such as BLAST, ClustalW...

Any tips are very appreciated! Stefanie

Edit:

Thanks for all the answers! I'll try to make things more clear:

We cloned about 1000 RNAi constructs. Each of this constructs has an unique ID. Already at this step, there is a lot of information (primer sequence, RNAi sequence, FL sequence, contig memberships, BLASTX...). After this, the constructs are tested in plants by particle bombardment. Each experiments consist of about 16 constructs plus some controls. To each of them, we get a value plus many images. If a construct shows some effect, it's will be repeated at least 4 more times and we'll get at least 4 more values (the final number of values is unknown). After we have at least 5 values, we calculate some statistics. If it's significant, we make transgenic plants. Of course to each construct we want to add as much information as possible (NCBI, Proteomics, whatever. This is allways growing.).

ALL this information, we have in different excel files and we would like to show it in a web interface. I started already to combine all this information in a few excel files. I agree, this is the most laborious work. The advantages is, that I'm the technician which produced the data, so I know very good what is what. I created 2 excel files, one with all the experimental data per unique ID and one with all other information per unique ID. I think this is a good basic to add it to a database.

What I would like to have: Paste anything (the unique ID, all values<50%, an image link, a function, a TCA...) and get ALL linked information. For instance, an overview about the corresponding unique ID(s), plus some tabs which contain all the other information (tab for experimental data, tab for sequence information, tab for images and so on). Maybe something like this: http://www.gabipd.org/database/cgi-bin/GreenCards.pl.cgi?BioObjectId=664184&Mode=ShowBioObject&QueryKey=fb2a0d954e12a90b7810b5d445957eb0&Start=1&Rows=50

In addition, it would be great, if data could be easily edited or commented.

I hope I made things more clear. Thanks again! Stefanie

web database python • 6.0k views

ADD COMMENT • link updated 14.1 years ago by Chris Evelo 10k • written 14.1 years ago by Snowflake ▴ 50

4

Entering edit mode

Am I correct in understanding that you want to build an application that includes a data management system and a web-based front end for displaying results and allowing others to interact with the data?

ADD REPLY • link 14.1 years ago by Mndoci ★ 1.2k

2

Entering edit mode

See also this somewhat-similar question: What Programming Language Is Best To Learn For Getting Into Web-Based Bioinformatics?

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.1 years ago by Neilfws 49k

score 9 · Answer 1 · 2011-03-27

I waited a bit since people might come up with an answer that would fit your purpose entirely and that would not require developing new code. But since that so far is not the case...

What you want to do is very close to what is the purpose of our open source systems biology database dbNP. You can find the development information for dbNP [?]here[?]. Note that this really is a large development project, already involving many communities. It will not fit your purpose right away. But you could add your own modules and benefit from the general work that others have done and are still doing.

dbNP was initiated as a nutritional phenotype database, really a systems biology database for nutrigenomics research, but since the approach is generic it is useful in other fields as well.

The concept for dbNP was [?]published in Genes and Nutrition[?], and we recently published a [?]second paper[?] about the (intended) query approaches on a biological level. The approach was recently featured in Lucas Laursen's perspective in the [?]Nature Supplement about Nutrigenomics[?].

dbNP is designed as a modular data structure which will contain data processing pipelines for many different types of data, both from genomics and from non-genomics approaches. The word database thus is not very accurate.

One central module captures the study design. A first version of this [?]generic study capture framework[?] is already available. You can also access a [?]demo[?]. The study capture module captures the design of the study, the samples collected and the assays performed and links to the actual experimental data. It follows the [?]isa-tab[?] philosophy, and we are in fact working with Susanna Sansone's group in Oxford to implement isa-tab input and output itself. This will also allow you to upload complex multi-omics studies to the EBI data repositories in one isa-archive. The study capturing uses NCBO ontologies from the [?]bioportal[?] to get unique descriptors for many things. GSCF was developed by the Nutrigenomics Organisation (NuGO) and the Netherlands Bioinformatics Center (NBIC). GSCF was tested with data from the nutrigenomics field and (by members of the Netherlands Toxicogenomic Centre) for toxicogenomics experiments. This showed the need to describe more complex study descriptions, which is the main reason isa-tab has not been implemented already. We need to carefully extend the study description standard itself. On this and some other aspects the dbNP team now collaborates with the [?]SysMO-DB[?] team that works on a comparable initiative for data relating to micro organisms.

The other central module that is in an advanced state of development is the so called "simple assay module" [?]SAM[?]. This module captures descriptive data like length, weight, gender but also clinical chemistry data etc. Of course this module needed to be user extendible itself.

Currently the most advanced genomics module is the one for metabolomics data which is being developed collaboratively by the Netherlands Metabolomics Center and the EU program for micronutrient recommendations [?]EURRECA[?].

Still the approach used can probably be best understood using the microarray module design as an example. Conceptually each of the genomics modules contains four data levels plus the pipelines to get from one to the other. For Affymetrix microarrays the first level simply consists of the raw data (.cel) files. These are then quality controlled, filtered and normalized to yield clean data. We develop the processing pipeline for this as a separate open web portal that is available at [?]arrayanalysis.org[?]. In the next step the data is statistically evaluated using both standardized statistical approaches and the ones selected by the original research team (the first is done to make data as much comparable between studies as is possible, an approach copied from [?]ArrayExpress-Atlas[?]), this evaluation follows the study design but the idea is that you can also recalculate comparing other groups than originally intended. Finally the fourth data level will store "biological profiles", the outcome of pathway and gene ontology analyses, and thus in principle allows you to answer questions like "what studies found the same type of biological larger overall effect as I did". In principle that could mean that you did a transcriptomics study in say high and low fat diets determined in liver tissue in mice and you could find for instance a proteomics result in the brain of Alzheimer patients that would show the same overall profile (I made up that example, I am no saying that that is biologically real).

Other work that has already been done relates ChIP and DNA methylation array data, where a lot of R code was already developed. The Genetics module is more in a conceptual state. The latter is US lead with participation of people from FDA (division of personal medicine) and UC Davis. I think Larry Parnell has an advisory role in that part.

The core project is developed in [?]grails[?], but many of the pipelines are in R and some of the query tools are in Java. For new modules, like an iRNA module you could probably use Python just as well. But I would discuss that with the core developer team before starting on it.

score 7 · Answer 2 · 2011-03-27

7

Entering edit mode

14.1 years ago

Istvan Albert 102k

An important aspect to keep in mind is that databases work best for reasonably simple and well defined data types.

The more diverse the data types (the experiments that you want to describe) the more difficult it will become to model them in a way that allows your to cross reference and query them in a unified manner. In fact it could even happen that integrating previously unknown data type would require a major redesign of all the existing schemas.

This is not Django specific but more of a characteristic of relational object models. So I think your biggest challenge is less about python or Django but modeling your data so that it fits the relational database paradigms.

ADD COMMENT • link 14.1 years ago by Istvan Albert 102k

1

Entering edit mode

Any chance that you have a link or reference for where you read this?

ADD REPLY • link 14.1 years ago by biobot 0.0.77.a.1099 6.2k

1

Entering edit mode

@Keith James: Sure... :-) ...BOOK: Beautiful Data; CHAPTER: Life in Data: The Story of DNA; BY: Matt Wood, Ben Blackburne http://oreilly.com/catalog/9780596157128