Question

create plink files from 23andMe JSON files

0

Entering edit mode

10.2 years ago

yorgos.athanasiadis ▴ 70

Hi, I have several 23andMe files in JSON format that I want to merge in order to create a ped and a map file set for my plik analyses.

Is there any existing tool that does this job?

(I'm trying to avoid the 4-column 23andMe format because some individuals are missing some of the SNPs).

All the best,

Yorgos

plink JSON SNP 23andMe • 4.1k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by yorgos.athanasiadis ▴ 70

0

Entering edit mode

Geia sou Yorgo,

Can you post a short example of a unit of data from the JSON file, and a brief example of the output that you would like to get from that sample data unit?

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by Dan D 7.4k

0

Entering edit mode

Geia sou Deedee,

JSON files are really huge but they look like this:

{"id": "some_id_label", "genome":"__AAGGAAAAAAAAAA__AA__GGAAAA__AAAAAAAA__AAAAAA__AAAAAAAAAAAA__AAAA__AAAA____AA__AAAACCTTTT__CC__CC__CCCCAA____CCTT____TTCC__CC________CC____CCCCCCCCCC______GGGG__GGGGGGGG__GGGGAA__GGAAGGGGGGGGGGGGGG____GG____GGGGAAGGGGGG__GGGGGGGGGGGGGGGGGGGGGGGG__AAGG__TTTTTTTTTTTTTT__CCTTTTTTTTTTTT__CCTTTTTTTTTTTTTTTT__TTTT__TTCC__TTTT__TTTT__TTTTTTTTTT______TT____TTAG__GTGGGTTTTTCTAG__GGAG__AA____CTGG__CCCC__TTTTCCTTAGCCAG__CTCTCCTTAA__CTGTCCCTCCAGCCGGAA__CTGGCCTT__CCAATTCCCT__GGTTAGAATTAATTGGACGGGGCC__GGTTTT__GGCTGG____AAAACTCTTTGGCC____AAGGCCAAAGTTCT__AACC__AACCTTCT__AA__TTAAAAGTAACTAGAGAGAGCCTTCT______CC__AA__ACGG__TTGGGG__GGTTCC__AAACGGTTGG____GGCTGGCCTT__AACTAGAATTCCTTGG__AG__GGTTCCCCCCAGAACTAATTAGAG__GGCC__GGACCTAAAACTGGTTTTCTCCCTCC__CTAACTTTAG______TTAA__AAAGGG__CC__GGGGAA__AAAA__GGCT____GG__AA____AAAGCTAATT

Each pair of letters corresponds to one locus (mostly SNPs but sometimes also indels). Double underscore corresponds to missing genotype. We need a MAP file to understand the JSON files correctly.

PED files include the following fields (one line per individual):

Family_ID Subject_ID Father_ID Mother_ID Sex Disease_Status SNP1_allele1 SNP1_allele2 SNP2_allele1 SNP2_allele2 etc...

MAP files include the following fields:

Chromosome SNP_ID Genetic_discance BP_position

(genetic discance is irrelevant and can be set to 0).

I was hoping that there might be some statistical package or tool that does this job instead of having to write code from scratch.

All the best,

Yorgos

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by yorgos.athanasiadis ▴ 70

0

Entering edit mode

I see. So if "id" and "genome" are the only two properties for each data unit, then it's obviously not a translation of key-value pairs to a flat table.

I don't know of any tool that can do the processing work, but I'll check around as soon as I have time. Thanks for uploading that!

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by Dan D 7.4k

Ram · Answer 1 · 2014-09-30

Hi,

I do not know what you mean by 4-column 23andMe format, but here a thing you can do with R in order to go from.JSONto a merged Dataframe (called in this example:dfList) that you can use after to construct your .pedand .mapfiles (I think .ped and .mapare tab delimited txt files) :

install.packages("jsonlite")
install.packages("plyr")
library(jsonlite); library(plyr)

file1<- fromJSON("/../.JSON") # you can do a for loop here to not enter all your files manually (file2, file3,..)
dfList= list(file1,file2,....) # make all your files a list named dfList
merged.file=join_all(dfList) # merge them all based on common lines.

Once joined you can manipulate these files to create a .pedand .map file.

hope this would help !

Kiz

Ram · Answer 2 · 2014-09-30

If you don't have accompanying key files for the JSONs, you'll probably need to re-grab the genomic data. See here; that includes a link to the current genome-string-index-to-variant-info file, but it's updated every once in a while. Since you mention that some of your JSONs are missing some SNPs, it sounds like they aren't all compatible with the current key.

If you do re-grab the data, choose 4-column format if at all possible since PLINK 1.9 explicitly supports it: https://www.cog-genomics.org/plink2/input#23file