In our current data set, we have phenotype data on nearly one million individuals; we store this information in a Postgresql database. For about 500 of the individuals, we have 65k SNP chips that are all stored in their own text files. We are quickly obtaining more chips and looking for ways to efficiently organize these so that we can quickly select individuals from the DB that match certain phenotypic traits and try to build models from with their genotypes.
Does anyone have any experience with this? If we are just storing the sequence of alleles in the database, it would be impossible (postgres can't support 65K columns) and unreasonable to store one SNP per column. We could have a table where each record corresponds to a SNP and an individual, but then the table would be massive. Alternatively, we could try to use an array type and condense down the allele sequence to something that could fit in there (in this situation, each row would be an id for an individual and an array of all they alleles). I have also come across a paper http://www.sciencedirect.com/science/article/pii/S1476927107000059 where they have each row correspond to a specific SNP and that row has an array of the genotypes for the individuals. Do any of these methods make sense or does someone have a good recommendation?
Should we not even bother storing it in the database itself, but rather use the database to point to the text file somewhere else?
Sorry if my terminology is confusing and incorrect, I am a computer scientist just getting acquainted with bioinformatics!
Thank you!
Hi- I agree with the idea of "a table where each record corresponds to a SNP and an individual" (i.e. the normalized format in database parlance, right?). I also agree about avoiding arrays.
However, I doubt postgres or any RDBMS can cope with the situation. 65000 chips with each, say, 1 million SNPs makes a table of 65 billion rows! With appropriate indexes you could quickly access a given SNP in a given individual but range queries like "give me the SNPs on chr1 between position x and y) or any join operation will take forever. (Of course one could cut down the number of genotypes to store by excluding those that are equal to the reference?)
I agree, that the size will be extremely large and I am also not 100% sure if the database will be able to manage it efficiently without careful database design. At least, PostreSQL states that there are no row count limits. ;-) Depending on the query, the runtime will be large for sure - but runtime will likely be long for any approach in accessing such a huge data set.
I also think that a prior data reduction is a key step. For example, we do SNP6 analysis but do not store individual probe intensities but only pre-calculated gene copy number information. In particular, storing the raw SNP data in the database does not make sense if you want to further process it. I do not think that any available tool will be able to access a custom database ;-)
As mentioned, I am not sure about the actual data size. Probably, the OP can specify which of the following is correct:
I am not sure if it makes sense to have 65000 chips for 500 individuals. The would mean each individual was analyzed 130 times on average... but its hard to know without more information from the OP I guess...
PS: Yes - it is normalized format :)