Question

Forum:Schema recommendation for a gene sequencing database

0

Entering edit mode

8.4 years ago

Ram 44k

Hello all,

This question might be a bit odd, but I am working on transforming an in-house database to an analysis friendly structure, and I'd like to get some suggestions from the community.

Currently, each patient sequenced is a record, and each gene sequenced for this patient is a linked record. For each gene, the pathogenic mutation identified is stored in a field, and the comma separated list of non-pathogenic polymorphisms is stored in another field. The polymorphisms field has zygosity mentioned for each polymorphism.

The free-form nature of the polymorphisms field makes it difficult for queries that involve drilling down on the zygosity of specific polymorphisms across patients. I could normalize the database so each polymorphism is a separate record and there are tables linking each gene to each variant found to the zygosity of each variant found, but that is a level of complexity others in the team would not be able to work with.

Have you encountered such challenges? How would you address this problem?

I welcome your inputs - have a great day!

--

Ram

clinical-genetics database • 2.2k views

ADD COMMENT • link 19 months ago by Ram 44k

0

Entering edit mode

and the comma separated list of non-pathogenic polymorphisms is stored in another field

That does not sound like a good idea.

Whereas there may be differing opinions on just how "normalized" data should be this above is not even that - that sounds like a data that is not even stored as basic data representation of the database (unless your database knows how to query on a CSV separated field). That can cause of lot of trouble later on when the data models would need to change.

ADD REPLY • link 8.4 years ago by Istvan Albert 101k

0

Entering edit mode

Fortunately, FileMaker is great with CSV fields. It seems to have full-text indexing on all text fields, and as long as a regular query matches the beginning of a word, the match is retrieved. Of course, we can use wildcards (like in regex) to be more particular, but CSV is not a problem for the database itself.

The polymorphisms field has zygosity mentioned for each polymorphism

This is not actually true. I have them in this format:

snv_a, snv_b, snv_c HET, snv_d, snv_e HOM
# SNVs a,b and c are HET, SNVsd and e HOM

I know this is as near craziness as it gets, but the one thing I am happy about is that it's not a spreadsheet :)

ADD REPLY • link 8.4 years ago by Ram 44k

0

Entering edit mode

How big is the entire dataset? Do you need a database? Can you just flatfile it and load the entire thing into memory? Or organize your data in a way that you can load things piecemeal into memory? I tend to see a lot of over-engineering when it comes to databases. If this is just for personal analysis (or for a small-ish amount of interactions) and not some kind of millions of queries a second enterprise level application, I would just keep it simple and get on with the analysis.

ADD REPLY • link 8.4 years ago by Damian Kao 16k

0

Entering edit mode

This is exactly what I'm doing. It's a kinda-sorta flat file, with easy analysis and quick interface building, so I normalize as best as I can and enforce data constraints to a good level, but the problem is that it does not lend itself well to my analyses.

ADD REPLY • link 8.4 years ago by Ram 44k

0

Entering edit mode

I tend to see a lot of over-engineering when it comes to databases. If this is just for personal analysis (or for a small-ish amount of interactions) and not some kind of millions of queries a second enterprise level application, I would just keep it simple and get on with the analysis.

Damian made some good points here. In my experience it is quite common that data in LIMS systems (and the same seams to be true for ERP systems) is not adequately normalized. Then the data has to be exported from the LIMS into another database for analysis. In this step you often transform the data and apply some heuristic rules, since the lab people have used some conventions when they stored their primary data in the LIMS. This conventions often differ from project to project.

I call this the data warehouse approach: data is exported from the primary database (long term archival) into a secondary database optimized for analysis. Data is only imported by a single controlled process, not by online users. Although this sounds like a huge effort, it can be implemented quite simple with a sqlite3 database and some (python) scripts.

Most likely, this can be accomplished solely in memory as suggested by Damian, but I would recommend to use a persistent database to decouple the tasks. You can have different scripts for import of data, for running analysis, and for reporting. Coding and testing will be much easier with a persistent database.

ADD REPLY • link 8.4 years ago by piet ★ 1.9k

1

Entering edit mode

@piet: Please use ADD COMMENT/ADD REPLY when responding to existing posts. SUBMIT ANSWER should only be used for new answers for original question.

ADD REPLY • link 8.4 years ago by GenoMax 147k

0

Entering edit mode

Yes, mom. But in this case I believe that my posting has enough substance to be an answer on it's own right.

ADD REPLY • link 8.4 years ago by piet ★ 1.9k

1

Entering edit mode

Is it an answer for the question originally asked by @Ram or the point made by @Damian? You clearly have significant expertise with databases but I find these multiple answers confusing since the discussion is now split in pieces.

ADD REPLY • link 8.4 years ago by GenoMax 147k

0

Entering edit mode

Thank you. It would seem that I have tried a lot of what people are recommending. I have separate export layouts, for example, that circumvent storage-vs-display discrepancies.

I thought I'd write a transformation pipeline that performs interop between layers of data like you suggest, but the data notations are so idiosyncratic (there is a particular obstinate person that makes my life difficult) that I ultimately have to transform manually. Fortunately, the data is small (~6-7K records) that I can incrementally clean historic records so future analyses require smaller levels of transformation, but it looks like small steps are all I can take.

I seem to have moved on to cribbing now, so I'll stop :)

ADD REPLY • link 8.4 years ago by Ram 44k

score 1 · Answer 1 · 2016-07-11

1

Entering edit mode

8.4 years ago

piet ★ 1.9k

I could normalize the database so each polymorphism is a separate

You should ALWAYS normalize your data, otherwise you will not have any benefit from using a relational database.

that is a level of complexity others in the team would not be able to work with

You may hide this complexity behind a view or stored procedures. You should not design your database with a certain reporting format in mind.

ADD COMMENT • link 8.4 years ago by piet ★ 1.9k

0

Entering edit mode

Thanks, but I never said "relational" :) I need a structured data set that's easy to use for folks with limited technical skills. If I could use my RDBMS skills here, we would end up with perfect difficult-to-use data. Also, because there is no relational database, I cannot use stored procs or functions.

I fought the move-to-SQL battle for a year before I gave up and moved to the platform they liked - Filemaker.

ADD REPLY • link 8.4 years ago by Ram 44k

0

Entering edit mode

Also, ALWAYS normalize is a bad move - databases do not all need to be BCNF'd. The retrieval and data input frequencies determine the level of normalization one should go for. Even in core IT, where there are multiple layers between the data and the logic and the UI, we only normalize to the required extent, based on performance and what not, and not based on a theoretical ideal.

ADD REPLY • link 8.4 years ago by Ram 44k

1

Entering edit mode

I agree that data is often not fully normalized in praxis. But you were explicitly asking for advice on storing a one-to-many relationship as a comma separated list. That is a bad idea and I would avoid that. You want to distinguish between pathogenic and non-pathogenic polymorphisms, and you want to store them in different fields of their related record. Don't do that. Both kinds of polymorphisms are the same thing and should go into their own table. If you would use a linkage table between polymorphisms and patients, the information about pathogenicity could be elegantly stored as an additional column in the linkage table.

According to wikipedia, "FileMaker Pro is a cross-platform relational database application from FileMaker Inc", see https://en.wikipedia.org/wiki/FileMaker

ADD REPLY • link 8.4 years ago by piet ★ 1.9k

0

Entering edit mode

That is a huge problem. Once I go down the normalization path, I start alienating users. But I guess it's the only way this ends well. Sad compromises, man.

ADD REPLY • link 8.4 years ago by Ram 44k