How can I parse a json file from MobiDB to retrieve proteome data?
1
0
Entering edit mode
5.9 years ago
Jason2 • 0

I've downloaded the mobidb database human data in its json format. It contains protein disorder information for regions of each protein. The issue I'm having is converting the json format nested structure into a table.

The file can be downloaded here click on "json" next to human

Here's an example of two proteins:

{  "acc" : "O43760",  "sequence" : "MESGAYGAAKAGGSFDLRRFLTQPQVVARAVCLVFALIVFSCIYGEGYSNAHESKQMYCVFNRNEDACRYGSAIGVLAFLASAFFLVVDAYFPQISNATDRKYLVIGDLLFSALWTFLWFVGFCFLTNQWAVTNPKDVLVGADSVRAAITFSFFSIFSWGVLASLAYQRYKAGVDDFIQNYVDPTPDPNTAYASYPGASVDNYQQPPFTQNAETTEGYQPPPVY",  "ncbi_taxon_id" : 9606,  "organism" : "Homo sapiens (Human)",  "mobidb_consensus" : {  "disorder" : {  "predictors" : [ { "regions" : [ [ 1, 19, "D" ], [ 46, 54, "D" ], [ 96, 100, "D" ], [ 178, 224, "D" ] ], "method" : "simple" }, { "regions" : [ ], "dc" : 0, "method" : "mobidb-lite", "scores" : [ 0.625, 0.75, 0.75, 0.5, 0.625, 0.625, 0.5, 0.375, 0.5, 0.5, 0.375, 0.375, 0.375, 0.25, 0.25, 0.25, 0.25, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0.25, 0.375, 0.375, 0.375, 0.375, 0.5, 0.5, 0.5, 0.375, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.625, 0.5, 0.5, 0.625, 0.75, 0.75, 0.75, 0.75, 0.875, 0.75, 0.875, 0.625, 0.625, 0.875, 0.875, 0.875, 0.75, 0.75, 0.75 ] } ] } } }
{  "acc" : "Q92728",  "sequence" : "MPPKTPRKTAATAAAAAAEPPAPPPPPPPEEDPEQDSGPEDLPLVRGSITNGR",  "ncbi_taxon_id" : 9606,  "organism" : "Homo sapiens (Human)",  "mobidb_consensus" : {  "disorder" : {  "predictors" : [ { "regions" : [ [ 1, 53, "D" ] ], "method" : "simple" }, { "regions" : [ [ 1, 53, "D_WC" ] ], "dc" : 1, "method" : "mobidb-lite", "scores" : [ 1, 1, 0.875, 1, 1, 0.875, 0.75, 0.875, 0.75, 0.75, 0.75, 0.75, 0.875, 0.875, 0.875, 0.875, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.875, 0.875, 0.875, 0.875, 0.875, 0.875, 1, 1, 1, 1, 1, 1, 1, 0.875, 0.875, 0.875, 0.875, 0.875, 0.75, 0.75, 0.875, 1, 0.875, 0.875, 1, 1, 0.875, 0.875 ] } ] } } }

This type of structure is repeated thousands of times to capture information across many proteins.

I would like to format the data so that it looks like a simple table where I have columns for acc ID and disorder predictor regions :

ID, region_start, region_end, disorder_type

O43760, 1, 19, "D",

O43760, 46, 54, "D" ,

O43760, 96, 100, "D",

O43760, 178, 224, "D",

Q92728, 1, 53, "D",

I know how to use terminal (i.e. downloaded software, awk) fairly well and I also use R, so if you could recommend solutions using those tools it would be greatly appreciated but if you know of another way that's fine. I have been playing around with jq and jt but haven't succeeded in using them to address this problem yet.

Any help would be appreciated! Thanks!

database annotation parsing proteins format • 1.9k views
ADD COMMENT
0
Entering edit mode

Can I ask how did you get your file to come in the json format, overtime I have tried to download it i can only get mjson format which does not load on R?

ADD REPLY
0
Entering edit mode

I doubt they ever had the files in json format. At least when I wrote the script shown below, the files were in mjson format (see Usage portion below; the filename as mjson in it).

ADD REPLY
2
Entering edit mode
5.9 years ago
vkkodali_ncbi ★ 3.8k

If you can use python, there is a module called json that can deal with this. Check out https://docs.python.org/3/library/json.html and specifically the 'Decoding JSON' part.

You can use the quick-and-dirty script shown below:

Usage:

./disorder_to_tbl.py disorder_UP000005640.mjson.gz > output_table.tsv

At least for the human file you have pointed to, I did not encounter any errors.

ADD COMMENT
0
Entering edit mode

Thanks! However, I really need to do this for the proteome so hard coding each protein would be difficult. Is there a way to loop over each protein in a high throughput manner?

ADD REPLY
1
Entering edit mode

I updated my answer to change it to a script that you can use.

ADD REPLY

Login before adding your answer.

Traffic: 2775 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6