Question

How can I parse a json file from MobiDB to retrieve proteome data?

0

Entering edit mode

5.9 years ago

Jason2 • 0

I've downloaded the mobidb database human data in its json format. It contains protein disorder information for regions of each protein. The issue I'm having is converting the json format nested structure into a table.

The file can be downloaded here click on "json" next to human

Here's an example of two proteins:

{  "acc" : "O43760",  "sequence" : "MESGAYGAAKAGGSFDLRRFLTQPQVVARAVCLVFALIVFSCIYGEGYSNAHESKQMYCVFNRNEDACRYGSAIGVLAFLASAFFLVVDAYFPQISNATDRKYLVIGDLLFSALWTFLWFVGFCFLTNQWAVTNPKDVLVGADSVRAAITFSFFSIFSWGVLASLAYQRYKAGVDDFIQNYVDPTPDPNTAYASYPGASVDNYQQPPFTQNAETTEGYQPPPVY",  "ncbi_taxon_id" : 9606,  "organism" : "Homo sapiens (Human)",  "mobidb_consensus" : {  "disorder" : {  "predictors" : [ { "regions" : [ [ 1, 19, "D" ], [ 46, 54, "D" ], [ 96, 100, "D" ], [ 178, 224, "D" ] ], "method" : "simple" }, { "regions" : [ ], "dc" : 0, "method" : "mobidb-lite", "scores" : [ 0.625, 0.75, 0.75, 0.5, 0.625, 0.625, 0.5, 0.375, 0.5, 0.5, 0.375, 0.375, 0.375, 0.25, 0.25, 0.25, 0.25, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0.25, 0.375, 0.375, 0.375, 0.375, 0.5, 0.5, 0.5, 0.375, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.625, 0.5, 0.5, 0.625, 0.75, 0.75, 0.75, 0.75, 0.875, 0.75, 0.875, 0.625, 0.625, 0.875, 0.875, 0.875, 0.75, 0.75, 0.75 ] } ] } } }
{  "acc" : "Q92728",  "sequence" : "MPPKTPRKTAATAAAAAAEPPAPPPPPPPEEDPEQDSGPEDLPLVRGSITNGR",  "ncbi_taxon_id" : 9606,  "organism" : "Homo sapiens (Human)",  "mobidb_consensus" : {  "disorder" : {  "predictors" : [ { "regions" : [ [ 1, 53, "D" ] ], "method" : "simple" }, { "regions" : [ [ 1, 53, "D_WC" ] ], "dc" : 1, "method" : "mobidb-lite", "scores" : [ 1, 1, 0.875, 1, 1, 0.875, 0.75, 0.875, 0.75, 0.75, 0.75, 0.75, 0.875, 0.875, 0.875, 0.875, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.875, 0.875, 0.875, 0.875, 0.875, 0.875, 1, 1, 1, 1, 1, 1, 1, 0.875, 0.875, 0.875, 0.875, 0.875, 0.75, 0.75, 0.875, 1, 0.875, 0.875, 1, 1, 0.875, 0.875 ] } ] } } }

This type of structure is repeated thousands of times to capture information across many proteins.

I would like to format the data so that it looks like a simple table where I have columns for acc ID and disorder predictor regions :

ID, region_start, region_end, disorder_type

O43760, 1, 19, "D",

O43760, 46, 54, "D" ,

O43760, 96, 100, "D",

O43760, 178, 224, "D",

Q92728, 1, 53, "D",

I know how to use terminal (i.e. downloaded software, awk) fairly well and I also use R, so if you could recommend solutions using those tools it would be greatly appreciated but if you know of another way that's fine. I have been playing around with jq and jt but haven't succeeded in using them to address this problem yet.

Any help would be appreciated! Thanks!

database annotation parsing proteins format • 1.9k views

ADD COMMENT • link updated 5.8 years ago by ieuangw • 0 • written 5.9 years ago by Jason2 • 0

0

Entering edit mode

Can I ask how did you get your file to come in the json format, overtime I have tried to download it i can only get mjson format which does not load on R?

ADD REPLY • link 5.8 years ago by ieuangw • 0

0

Entering edit mode

I doubt they ever had the files in json format. At least when I wrote the script shown below, the files were in mjson format (see Usage portion below; the filename as mjson in it).

ADD REPLY • link 5.8 years ago by vkkodali_ncbi ★ 3.8k

score 2 · Answer 1 · 2019-01-09

2

Entering edit mode

5.9 years ago

vkkodali_ncbi ★ 3.8k

If you can use python, there is a module called json that can deal with this. Check out https://docs.python.org/3/library/json.html and specifically the 'Decoding JSON' part.

You can use the quick-and-dirty script shown below:

Usage:

./disorder_to_tbl.py disorder_UP000005640.mjson.gz > output_table.tsv

At least for the human file you have pointed to, I did not encounter any errors.

ADD COMMENT • link 5.9 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

Thanks! However, I really need to do this for the proteome so hard coding each protein would be difficult. Is there a way to loop over each protein in a high throughput manner?