I'm writing to ask if anyone is aware of database backed programs for management of SNP data, besides the ones that I list below. To a first approximation, the purpose of such a tool is to import data (which often requires validation) from upstream source files into a database, and then export it into a form usable by analysis.
As discussed in the paper, tools like PLINK expect the data to already be in a format they can use. However, getting data into that format can be a nightmare, especially if the data is dirty. So, such a system would be a supplement to existing tools. Of course, once the data is in the database, it can used for other things.
I started writing one of these in 2008 in desperation when dealing with some particularly dirty data. This program is called SNPpy. See the PLoS ONE paper online.
I have looked at other software that does this, but only found two,
namely SNPLims: a data management system for genome wide association
studies, and GWAS
Analyzer: integrating genotype, phenotype and public annotation data
for genome-wide association study
analysis.
However, the lead author of SNPLims
told me the source code is
unavailable, and GWAS Analyzer
has (in my opinion) major usability
issues. I'm using the source code available
here.
I am not aware of any other systems. I find it hard to believe a system like this is not in standard use - perhaps I am missing something. It seems entirely possible that other systems have been created but are proprietary or have simply not been written about.
So, I'm writing to ask if anyone has written or is otherwise using a system like this, aside from those listed here, or if not, is aware of one. Thanks.
EDIT: Updated with the recently published PLoS ONE paper. Note: I'm also trying to upload my SNPpy paper to arXiv, but they have some annoying endorsing procedure, where someone has to endorse me who has recently (at least 2 papers in the last 5 years) uploaded papers to the Quantitative Biology section in arXiv. If you can help, please add a comment. Thanks.
What paper discusses PLINLK? And what is wrong with storing your data in this format? No matter what system you use, you still need a format to input.
This is a great question and has inspired me to ask around to see how others approach the storage of SNP-based information, especially beyond the phenotype association stuff.
@Adrian: My paper talks about PLINK briefly. I'm not sure what you mean by "what is wrong with storing your data in this format?". What format is that?
@Larry: I would be interested in your comments on my paper, if you care to look at it.
What I meant to ask you was, what is wrong with storing your data in PLINK?
You mention that getting data into PLINK format is a nightmare. Wouldn't this be the case as well with other storing format?
@Adrian: I'm not sure what you mean by storing your data in PLINK. Perhaps you mean MAP/PED format (and similar formats like the transposed/long-format/binary ones)? I don't think these are specific to PLINK, though. In any case, there is nothing wrong with these formats, and, yes, getting the data into any file format is problematic, especially if the data is dirty, and one is converting directly from source files. Hence the motivation for using the database (which can do validation) as an intermediate format. Have you looked at my paper? This goes into the motivation at length.