Hello all,
This question might be a bit odd, but I am working on transforming an in-house database to an analysis friendly structure, and I'd like to get some suggestions from the community.
Currently, each patient sequenced is a record, and each gene sequenced for this patient is a linked record. For each gene, the pathogenic mutation identified is stored in a field, and the comma separated list of non-pathogenic polymorphisms is stored in another field. The polymorphisms field has zygosity mentioned for each polymorphism.
The free-form nature of the polymorphisms field makes it difficult for queries that involve drilling down on the zygosity of specific polymorphisms across patients. I could normalize the database so each polymorphism is a separate record and there are tables linking each gene to each variant found to the zygosity of each variant found, but that is a level of complexity others in the team would not be able to work with.
Have you encountered such challenges? How would you address this problem?
I welcome your inputs - have a great day!
--
Ram
That does not sound like a good idea.
Whereas there may be differing opinions on just how "normalized" data should be this above is not even that - that sounds like a data that is not even stored as basic data representation of the database (unless your database knows how to query on a CSV separated field). That can cause of lot of trouble later on when the data models would need to change.
Fortunately, FileMaker is great with CSV fields. It seems to have full-text indexing on all text fields, and as long as a regular query matches the beginning of a word, the match is retrieved. Of course, we can use wildcards (like in regex) to be more particular, but CSV is not a problem for the database itself.
This is not actually true. I have them in this format:
I know this is as near craziness as it gets, but the one thing I am happy about is that it's not a spreadsheet :)
How big is the entire dataset? Do you need a database? Can you just flatfile it and load the entire thing into memory? Or organize your data in a way that you can load things piecemeal into memory? I tend to see a lot of over-engineering when it comes to databases. If this is just for personal analysis (or for a small-ish amount of interactions) and not some kind of millions of queries a second enterprise level application, I would just keep it simple and get on with the analysis.
This is exactly what I'm doing. It's a kinda-sorta flat file, with easy analysis and quick interface building, so I normalize as best as I can and enforce data constraints to a good level, but the problem is that it does not lend itself well to my analyses.
Damian made some good points here. In my experience it is quite common that data in LIMS systems (and the same seams to be true for ERP systems) is not adequately normalized. Then the data has to be exported from the LIMS into another database for analysis. In this step you often transform the data and apply some heuristic rules, since the lab people have used some conventions when they stored their primary data in the LIMS. This conventions often differ from project to project.
I call this the data warehouse approach: data is exported from the primary database (long term archival) into a secondary database optimized for analysis. Data is only imported by a single controlled process, not by online users. Although this sounds like a huge effort, it can be implemented quite simple with a sqlite3 database and some (python) scripts.
Most likely, this can be accomplished solely in memory as suggested by Damian, but I would recommend to use a persistent database to decouple the tasks. You can have different scripts for import of data, for running analysis, and for reporting. Coding and testing will be much easier with a persistent database.
@piet: Please use
ADD COMMENT/ADD REPLY
when responding to existing posts.SUBMIT ANSWER
should only be used for new answers for original question.Yes, mom. But in this case I believe that my posting has enough substance to be an answer on it's own right.
Is it an answer for the question originally asked by @Ram or the point made by @Damian? You clearly have significant expertise with databases but I find these multiple answers confusing since the discussion is now split in pieces.
Thank you. It would seem that I have tried a lot of what people are recommending. I have separate export layouts, for example, that circumvent storage-vs-display discrepancies.
I thought I'd write a transformation pipeline that performs interop between layers of data like you suggest, but the data notations are so idiosyncratic (there is a particular obstinate person that makes my life difficult) that I ultimately have to transform manually. Fortunately, the data is small (~6-7K records) that I can incrementally clean historic records so future analyses require smaller levels of transformation, but it looks like small steps are all I can take.
I seem to have moved on to cribbing now, so I'll stop :)