I've downloaded the mobidb database human data in its json format. It contains protein disorder information for regions of each protein. The issue I'm having is converting the json format nested structure into a table.
The file can be downloaded here click on "json" next to human
Here's an example of two proteins:
{ "acc" : "O43760", "sequence" : "MESGAYGAAKAGGSFDLRRFLTQPQVVARAVCLVFALIVFSCIYGEGYSNAHESKQMYCVFNRNEDACRYGSAIGVLAFLASAFFLVVDAYFPQISNATDRKYLVIGDLLFSALWTFLWFVGFCFLTNQWAVTNPKDVLVGADSVRAAITFSFFSIFSWGVLASLAYQRYKAGVDDFIQNYVDPTPDPNTAYASYPGASVDNYQQPPFTQNAETTEGYQPPPVY", "ncbi_taxon_id" : 9606, "organism" : "Homo sapiens (Human)", "mobidb_consensus" : { "disorder" : { "predictors" : [ { "regions" : [ [ 1, 19, "D" ], [ 46, 54, "D" ], [ 96, 100, "D" ], [ 178, 224, "D" ] ], "method" : "simple" }, { "regions" : [ ], "dc" : 0, "method" : "mobidb-lite", "scores" : [ 0.625, 0.75, 0.75, 0.5, 0.625, 0.625, 0.5, 0.375, 0.5, 0.5, 0.375, 0.375, 0.375, 0.25, 0.25, 0.25, 0.25, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0.25, 0.375, 0.375, 0.375, 0.375, 0.5, 0.5, 0.5, 0.375, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.625, 0.5, 0.5, 0.625, 0.75, 0.75, 0.75, 0.75, 0.875, 0.75, 0.875, 0.625, 0.625, 0.875, 0.875, 0.875, 0.75, 0.75, 0.75 ] } ] } } }
{ "acc" : "Q92728", "sequence" : "MPPKTPRKTAATAAAAAAEPPAPPPPPPPEEDPEQDSGPEDLPLVRGSITNGR", "ncbi_taxon_id" : 9606, "organism" : "Homo sapiens (Human)", "mobidb_consensus" : { "disorder" : { "predictors" : [ { "regions" : [ [ 1, 53, "D" ] ], "method" : "simple" }, { "regions" : [ [ 1, 53, "D_WC" ] ], "dc" : 1, "method" : "mobidb-lite", "scores" : [ 1, 1, 0.875, 1, 1, 0.875, 0.75, 0.875, 0.75, 0.75, 0.75, 0.75, 0.875, 0.875, 0.875, 0.875, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.875, 0.875, 0.875, 0.875, 0.875, 0.875, 1, 1, 1, 1, 1, 1, 1, 0.875, 0.875, 0.875, 0.875, 0.875, 0.75, 0.75, 0.875, 1, 0.875, 0.875, 1, 1, 0.875, 0.875 ] } ] } } }
This type of structure is repeated thousands of times to capture information across many proteins.
I would like to format the data so that it looks like a simple table where I have columns for acc ID and disorder predictor regions :
ID, region_start, region_end, disorder_type
O43760, 1, 19, "D",
O43760, 46, 54, "D" ,
O43760, 96, 100, "D",
O43760, 178, 224, "D",
Q92728, 1, 53, "D",
I know how to use terminal (i.e. downloaded software, awk) fairly well and I also use R, so if you could recommend solutions using those tools it would be greatly appreciated but if you know of another way that's fine. I have been playing around with jq and jt but haven't succeeded in using them to address this problem yet.
Any help would be appreciated! Thanks!
Can I ask how did you get your file to come in the json format, overtime I have tried to download it i can only get mjson format which does not load on R?
I doubt they ever had the files in json format. At least when I wrote the script shown below, the files were in mjson format (see Usage portion below; the filename as mjson in it).