Question

SNP array data storage and management with RDBMS (postgres)

1

Entering edit mode

7.7 years ago

bluepoint ▴ 10

In our current data set, we have phenotype data on nearly one million individuals; we store this information in a Postgresql database. For about 500 of the individuals, we have 65k SNP chips that are all stored in their own text files. We are quickly obtaining more chips and looking for ways to efficiently organize these so that we can quickly select individuals from the DB that match certain phenotypic traits and try to build models from with their genotypes.

Does anyone have any experience with this? If we are just storing the sequence of alleles in the database, it would be impossible (postgres can't support 65K columns) and unreasonable to store one SNP per column. We could have a table where each record corresponds to a SNP and an individual, but then the table would be massive. Alternatively, we could try to use an array type and condense down the allele sequence to something that could fit in there (in this situation, each row would be an id for an individual and an array of all they alleles). I have also come across a paper http://www.sciencedirect.com/science/article/pii/S1476927107000059 where they have each row correspond to a specific SNP and that row has an array of the genotypes for the individuals. Do any of these methods make sense or does someone have a good recommendation?

Should we not even bother storing it in the database itself, but rather use the database to point to the text file somewhere else?

Sorry if my terminology is confusing and incorrect, I am a computer scientist just getting acquainted with bioinformatics!

Thank you!

SNP rdbms microarray • 2.6k views

ADD COMMENT • link updated 6.3 years ago by Biostar 20 • written 7.7 years ago by bluepoint ▴ 10

score 1 · Answer 1 · 2017-02-15

I am not absolutely sure if I understood your data structure correctly, by I would go for a solution with "a table where each record corresponds to a SNP and an individual". Even though such a table will become large, Postgres should be able to handle this. In particular, if you set up unique index over the SNP and individual. Also, this format provides the most flexibility in terms of querying and extending the database (e.g., which individuals have a particular SNP).

I strongly vote against using array data types. Updates of arrays will become a pain if you need to add another individual. Additionally, you have to manage the individual-to-array-index by yourself. Postgres is able to ensure referential integrity if you store data in long-table format.

Actually, I use the long-table format to store micro-array expression data in a PostgreSQL database (and other kinds of data, but for now mirco-array is the largest data set). For example, the following selection queries the normalized intensities of all probes measuring EGFR expression in all samples. The query requires less than <300ms for the first execution and ~25ms for subsequent executions (including changing the gene name)

select ds.name, probeset.probeset, symbol.accession, gcrma.gcrma
  from bioinfo_hgu133_gcrmas as gcrma, bioinfo_hgu133_probesets as probeset, bioinfo_hgu133_mappings as mapping, bioinfo_genomic_genesymbols as symbol, bioinfo_hgu133_datasets as dataset, bioinfo_datasets as ds
where gxp.probeset_id = probeset.id
  and probeset.id = mapping.probeset_id
  and mapping.genesymbol_id = symbol.id
  and gxp.hgu133_dataset_id = dataset.id
  and dataset.dataset_id = ds.id
  and symbol.accession = 'EGFR'

The query runs on 792 individual datasets (i.e., micro arrays), >50k probesets per dataset mapped to 90k genes. The total number of individual measurements (gcrma.gcrma) is 38 million.

score 0 · Answer 2 · 2017-02-14

0

Entering edit mode

7.7 years ago

Petr Ponomarenko ★ 2.8k

Depends on the speed you want, your budget and how often such search is needed.

I can help you with this. How may I contact you? You can email me at pon.petr@gmail.com

Thank you

ADD COMMENT • link 7.7 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

There are several approaches:

1) reduce number of mutations per sample by removing "not interesting data" and use sql database

2) use commercial database (i.e. SAP HANA)

3) store samples or groups in separate files (splitting by chromosomes is also an option)

4) use in-memory databases for well-designed indexes

Many more ways.

I love big bio data analysis:)

ADD REPLY • link 7.7 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Oh, and if you have microarray data than map and per file pairs is a good way to go. Plus you can use plink program for a lot of different ways to analyze and process your data

ADD REPLY • link 7.7 years ago by Petr Ponomarenko ★ 2.8k

score 0 · Answer 3 · 2017-02-15

I don't have direct experience with the problem you describe, so I just throw some thoughts...

My gut feeling is that storing genotype data of that size in a traditional RDMBS is not feasible and maybe not even necessary. I think an RDMBS would be useful if multiple users need to access relatively few SNPs at a time. Also an RDMBS is great to enforce referential integrity.

I think I would store the sample metadata in a RDMBS, these data is relatively small and should benefit from enforcing referential integrity. The massive genotype data would stay either in indexed files (e.g. tabix), or using a dedicated framework like gemini or a non-relational database.

It's an interesting problem, if you get a chance post here your experience on how you go about it!