How does 23andme store and analyze SNP array data of 12M humans?
1
0
Entering edit mode
2.8 years ago
William ★ 5.3k

According to their website they have 12M customers.

Number of SNPs on SNParray is less than WGS, but still c.a. 600,000 SNPs according to their website.

12.000.000 * 600.000 = 7.200.000.000.000 = 7.2 Trillion data points

Easiest to store that would be per sample or per plate as separate files.

But for joint analysis (e.g. gwas) a a table is needed.

Do they store that table as:

  1. As a VCF file
  2. In a relational database
  3. Some sort of NoSQL database
  4. plink files
  5. Some dedicated SNP array storage format
snparray 23andme • 726 views
ADD COMMENT
2
Entering edit mode
2.8 years ago
zx8754 12k

Analysis are done per SNP, region, gene, or pathway. Doubt anyone looks at 600K SNPs in one go. Plus most would be excluded as monomorphic or as highly correlated. SNPs can be split per chromosome or even smaller 5MB chunks. Samples could be split per population or phenotype. These steps would make the data manageable.

But, yes, it would be good to know how they do it.

Checking their publications, filter on "Methods" at https://research.23andme.com/publications/ give some idea how they do it. Also, from a quick look at their GitHub repo, looks like they use Python and work with VCFs - https://github.com/23andMe

ADD COMMENT

Login before adding your answer.

Traffic: 1714 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6