Webframework For Biological Data
5
5
Entering edit mode
13.8 years ago
Snowflake ▴ 50

Hello!

We are doing RNAi based screenings. There, we generate a lot of data for GenXY of different types (many microscopically images, many continuously growing data from different experiments...). In addition to this, there are a lot of sequence, expression, functional data information existing from other sources, which needs to be linked to our tested genes. At the moment, all these data are stored in hundreds of excel files. Now, we would like to create a database out of the files and display all information in a proper way. Since I know Python very well, I thought about using Django. Do you thing that Django is suitable for this? Does anybody used Django successfully for such a purpose? We also would like to include some applications, such as BLAST, ClustalW...

Any tips are very appreciated! Stefanie

Edit:

Thanks for all the answers! I'll try to make things more clear:

We cloned about 1000 RNAi constructs. Each of this constructs has an unique ID. Already at this step, there is a lot of information (primer sequence, RNAi sequence, FL sequence, contig memberships, BLASTX...). After this, the constructs are tested in plants by particle bombardment. Each experiments consist of about 16 constructs plus some controls. To each of them, we get a value plus many images. If a construct shows some effect, it's will be repeated at least 4 more times and we'll get at least 4 more values (the final number of values is unknown). After we have at least 5 values, we calculate some statistics. If it's significant, we make transgenic plants. Of course to each construct we want to add as much information as possible (NCBI, Proteomics, whatever. This is allways growing.).

ALL this information, we have in different excel files and we would like to show it in a web interface. I started already to combine all this information in a few excel files. I agree, this is the most laborious work. The advantages is, that I'm the technician which produced the data, so I know very good what is what. I created 2 excel files, one with all the experimental data per unique ID and one with all other information per unique ID. I think this is a good basic to add it to a database.

What I would like to have: Paste anything (the unique ID, all values<50%, an image link, a function, a TCA...) and get ALL linked information. For instance, an overview about the corresponding unique ID(s), plus some tabs which contain all the other information (tab for experimental data, tab for sequence information, tab for images and so on). Maybe something like this: http://www.gabipd.org/database/cgi-bin/GreenCards.pl.cgi?BioObjectId=664184&Mode=ShowBioObject&QueryKey=fb2a0d954e12a90b7810b5d445957eb0&Start=1&Rows=50

In addition, it would be great, if data could be easily edited or commented.

I hope I made things more clear. Thanks again! Stefanie

web database python • 5.5k views
ADD COMMENT
4
Entering edit mode

Am I correct in understanding that you want to build an application that includes a data management system and a web-based front end for displaying results and allowing others to interact with the data?

ADD REPLY
2
Entering edit mode
ADD REPLY
9
Entering edit mode
13.8 years ago

I waited a bit since people might come up with an answer that would fit your purpose entirely and that would not require developing new code. But since that so far is not the case...

What you want to do is very close to what is the purpose of our open source systems biology database dbNP. You can find the development information for dbNP [?]here[?]. Note that this really is a large development project, already involving many communities. It will not fit your purpose right away. But you could add your own modules and benefit from the general work that others have done and are still doing.

dbNP was initiated as a nutritional phenotype database, really a systems biology database for nutrigenomics research, but since the approach is generic it is useful in other fields as well.

The concept for dbNP was [?]published in Genes and Nutrition[?], and we recently published a [?]second paper[?] about the (intended) query approaches on a biological level. The approach was recently featured in Lucas Laursen's perspective in the [?]Nature Supplement about Nutrigenomics[?].

dbNP is designed as a modular data structure which will contain data processing pipelines for many different types of data, both from genomics and from non-genomics approaches. The word database thus is not very accurate.

One central module captures the study design. A first version of this [?]generic study capture framework[?] is already available. You can also access a [?]demo[?]. The study capture module captures the design of the study, the samples collected and the assays performed and links to the actual experimental data. It follows the [?]isa-tab[?] philosophy, and we are in fact working with Susanna Sansone's group in Oxford to implement isa-tab input and output itself. This will also allow you to upload complex multi-omics studies to the EBI data repositories in one isa-archive. The study capturing uses NCBO ontologies from the [?]bioportal[?] to get unique descriptors for many things. GSCF was developed by the Nutrigenomics Organisation (NuGO) and the Netherlands Bioinformatics Center (NBIC). GSCF was tested with data from the nutrigenomics field and (by members of the Netherlands Toxicogenomic Centre) for toxicogenomics experiments. This showed the need to describe more complex study descriptions, which is the main reason isa-tab has not been implemented already. We need to carefully extend the study description standard itself. On this and some other aspects the dbNP team now collaborates with the [?]SysMO-DB[?] team that works on a comparable initiative for data relating to micro organisms.

The other central module that is in an advanced state of development is the so called "simple assay module" [?]SAM[?]. This module captures descriptive data like length, weight, gender but also clinical chemistry data etc. Of course this module needed to be user extendible itself.

Currently the most advanced genomics module is the one for metabolomics data which is being developed collaboratively by the Netherlands Metabolomics Center and the EU program for micronutrient recommendations [?]EURRECA[?].

Still the approach used can probably be best understood using the microarray module design as an example. Conceptually each of the genomics modules contains four data levels plus the pipelines to get from one to the other. For Affymetrix microarrays the first level simply consists of the raw data (.cel) files. These are then quality controlled, filtered and normalized to yield clean data. We develop the processing pipeline for this as a separate open web portal that is available at [?]arrayanalysis.org[?]. In the next step the data is statistically evaluated using both standardized statistical approaches and the ones selected by the original research team (the first is done to make data as much comparable between studies as is possible, an approach copied from [?]ArrayExpress-Atlas[?]), this evaluation follows the study design but the idea is that you can also recalculate comparing other groups than originally intended. Finally the fourth data level will store "biological profiles", the outcome of pathway and gene ontology analyses, and thus in principle allows you to answer questions like "what studies found the same type of biological larger overall effect as I did". In principle that could mean that you did a transcriptomics study in say high and low fat diets determined in liver tissue in mice and you could find for instance a proteomics result in the brain of Alzheimer patients that would show the same overall profile (I made up that example, I am no saying that that is biologically real).

Other work that has already been done relates ChIP and DNA methylation array data, where a lot of R code was already developed. The Genetics module is more in a conceptual state. The latter is US lead with participation of people from FDA (division of personal medicine) and UC Davis. I think Larry Parnell has an advisory role in that part.

The core project is developed in [?]grails[?], but many of the pipelines are in R and some of the query tools are in Java. For new modules, like an iRNA module you could probably use Python just as well. But I would discuss that with the core developer team before starting on it.

ADD COMMENT
0
Entering edit mode

Thanks for all this information! There are very nice examples, I definitive will have a look on them. But I fear that it will cost a lot of time to adapt to a ready solution. The problem is, that I have quite short time and possibilities to do that.

ADD REPLY
7
Entering edit mode
13.8 years ago

An important aspect to keep in mind is that databases work best for reasonably simple and well defined data types.

The more diverse the data types (the experiments that you want to describe) the more difficult it will become to model them in a way that allows your to cross reference and query them in a unified manner. In fact it could even happen that integrating previously unknown data type would require a major redesign of all the existing schemas.

This is not Django specific but more of a characteristic of relational object models. So I think your biggest challenge is less about python or Django but modeling your data so that it fits the relational database paradigms.

ADD COMMENT
1
Entering edit mode

Any chance that you have a link or reference for where you read this?

ADD REPLY
1
Entering edit mode

@Keith James: Sure... :-) ...BOOK: Beautiful Data; CHAPTER: Life in Data: The Story of DNA; BY: Matt Wood, Ben Blackburne http://oreilly.com/catalog/9780596157128

ADD REPLY
0
Entering edit mode

@Istvan Albert: Was reading about how Sanger dynamically maps schemas on builds schemas (meaning that the system dynamically accounts for any changes within the system as it maps to the database and manages those dependencies up/down stream), but have been unable to find out how they're doing this. Ever heard of anything like this?

ADD REPLY
0
Entering edit mode

+1 @Istvan Albert: Was reading about how Sanger dynamically maps schemas changes onload; meaning that the system dynamically accounts for any changes within the system as it maps to the database and manages those dependencies up/down stream. That said, I've have been unable to find out how they're doing this. Ever heard of anything like this?

ADD REPLY
0
Entering edit mode

@Keith James: Sure... :-) ...BOOK: Beautiful Data; CHAPTER: Life in Data: The Story of DNA; BY: Matt Wood, Ben Blackburne -- might by wrong, but it appears Matt's no longer at Sanger, or at least this broken USER page leads me to believe that: http://www.sanger.ac.uk/Users/mw4/

ADD REPLY
0
Entering edit mode

@Keith James: Sure... :-) ...BOOK: Beautiful Data; CHAPTER: Life in Data: The Story of DNA; BY: Matt Wood, Ben Blackburne

ADD REPLY
0
Entering edit mode

The application in question is https://github.com/sanger/sequencescape

ADD REPLY
4
Entering edit mode
13.8 years ago
Tim Webster ▴ 60

Django is very, very cool but I suspect it is not a great fit. It depends on your data and exactly what you want to do.

Django is very good at ploughing through a database and mapping the tables onto classes and building web pages to browse all that. However, a lot of the power of Django (and Ruby on Rails, and many other similar kits) is that it uses "convention over configuration"--it makes some assumptions about what you want to do. In the case of your spreadsheet data, its assumptions are probably wrong. For example, even though Django works with database tables, it does not really treat them as tabular data...rather, a table is a collection of rows, and it gives you a lot of tools for finding, displaying, and editing individual rows. That's great when the rows represent things like website visitors, or products in a catalog, or CDs in your record collection, but it is probably not very useful for the kind of experimental data that I ordinarily see in spreadsheets. But maybe in your situation you do want to do a lot of zooming in on rows, in which case Django would be nice.

Even though Django is very automagical about building stuff from databases (and building databases from python classes) it will not be able to automagically clean up spreadsheet files. A lot of the magic is done by parsing out foreign key restraints between tables, so a lot depends on how easily you can take your large collection of excel spreadsheets and put them into a normalized database. My experience has been that such endeavors are usually very labor intensive, and the spreadsheets do not follow very consistent schemas. If you need to put all your excel stuff into mysql anyway, Django will do useful stuff with the resulting database, but I am not sure it is worth a migration simply to use Django.

If you do have a well-designed database, Django is very good at building web pages that let you browse around through data and follow the foreign key relations beween tables. It also provides many nice infrastructure things (like user management, content management, and an administrative interface) for free.

If you are planning to serve up the spreadsheets as they are, it sounds to me like you might want to consider something like a free off-the-shelf content management system like drupal (php) or magnolia (java). You can still customize such things (for example, write some kind of module to wrap blast or whatever) and also just dump the spreadsheets in (with, some effort, good tagging to help people find them).

--a recovering web developer

ADD COMMENT
0
Entering edit mode

For a Python-based CMS, Zope/Plone would also be a possibility.

ADD REPLY
0
Entering edit mode

I think once we know what we want, the data will be not changed dramatically. We'll rather add data all the time. That's why first I want to design the homepages, to be sure what we actually need. After I know this, I'll try to put all my data in a hopefully well designed database. This I have to do anyway, I have to get rid of excel. It's a big chaos, otherwise...

ADD REPLY
3
Entering edit mode
13.8 years ago
Blunders ★ 1.1k

You might check out Galaxy, believe it's OpenSource, and in Python.

Agree with @mndoci comment though, it's not all that clear what the objective of your project would be. I'd focus on clearly defining that, and who the audience for it would be before selecting a platform, database type, etc.

Very interested in hearing more about your project, and hope you'll update the description of your question to better reflect the system you're looking to build.

Cheers!

ADD COMMENT
3
Entering edit mode
13.8 years ago

Have a look at GMOD project. There are many tools you could use. Sequence and expression data can bee very easily stored and browsed with GBrowse (main part of GMOD). An example of what can be done with it is FlyBase, or sol genomics network (source code of the sol genomics database is available on github) where they store both sequence and image data. GMOD tools are written mostly in Perl. Chado is a relational database schema used very often in GMOD databases. It is based on sequence ontologies. GMOD has a very big active community and organizes workshops for developers.

Best,

ADD COMMENT

Login before adding your answer.

Traffic: 1429 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6