To Create Protovis Sunburst Charts : Python Script To Create Dataset In Json Format (Or) Parent - Child Json
3
4
Entering edit mode
11.7 years ago
ram.dsramesh ▴ 40

Help!!! As a biologist I am just interested in visualizing and displaying my data and I am very new to programming. Here I have a set of data in a excel file, which looks like this-

data set

I guess it's very perfect to display my data set using Sunburst chart in Protovis.

But I have stuck with preparing the data, which has to be in json format. If you notice that the structure of the data is hierarchical (Parent - Child hierarchy). Being not so good in programming (just know a little bit of Python) it's difficult to go ahead.

I need a python script which can read my excel file and generate a json as specified above.

In my data set, there is a parent and child relationship. L1 is Parent to L2 and L2 is parent to L3, so on...

  • >L1 (PARENT) - L2 (CHILD)
  • >L2 (PARENT) - L3 (CHILD)
  • >L3 (PARENT) - L4 (CHILD)
  • >L4 (PARENT) - L5 (CHILD)
  • >L5 (PARENT) - GENE_NAME (CHILD)

sunbusrt chat

Hope I can get my data set visualized in the above format. But I should have my data-set in the json format specified in here

I was looking to display my data something like this.

MY_IMAGE

Any sort of help appreciated.

python • 21k views
ADD COMMENT
7
Entering edit mode
11.7 years ago

Open your data in Excel and save it as a CSV file.

L1,L2,L3,L4,L5,GENE_NAME
Enzyme,Kinase,Protein Kinase,Ser_Thr,Cmgc,MAPK11
Enzyme,Kinase,Protein Kinase,Tyr,Tk,ABL1
Enzyme,Kinase,Protein Kinase,Tyr,Tk,PDGFRB
Enzyme,Kinase,Protein Kinase,Tyr,Tk,PDGFRA
Enzyme,Kinase,Protein Kinase,Ser_Thr,Tkl,ALK
Enzyme,Isomerase,Isomerase Other,,,gyrB
Enzyme,Oxidoreductase,Oxidoreductase Other,,,ALOX5
Enzyme,Oxidoreductase,Oxidoreductase Other,,,IMPDH1
Enzyme,Transferase,Transferase Other,,,COMT
Enzyme,Oxidoreductase,Oxidoreductase Other,,,RRM1
Enzyme,Oxidoreductase,Oxidoreductase Other,,,PTGS2
Enzyme,Lyase,Lyase Other,,,POLB
Enzyme,Lyase,Lyase Other,,,CA5B
Enzyme,Hydrolase,Hydrolase Other,,,GAA
Enzyme,Protease,Metallo,MAM,M10A,MMP8
Enzyme,Lyase,Lyase Other,,,CA5A
Enzyme,Lyase,Lyase Other,,,CA7

You can do the rest in Python:

import csv
import json
import sys

tree = {}

reader = csv.reader(open(sys.argv[1], 'rb'))
reader.next() 
for row in reader:
    subtree = tree
    for i, cell in enumerate(row):
        if cell:
            if cell not in subtree:
                subtree[cell] = {} if i<len(row)-1 else 1
            subtree = subtree[cell]

print json.dumps(tree, indent=4)

Save the script as csv2json.py and run it:

python csv2json.py test.csv

It gives you:

{
    "Enzyme": {
        "Protease": {
            "Metallo": {
                "MAM": {
                    "M10A": {
                        "MMP8": 1
                    }
                }
            }
        }, 
        "Isomerase": {
            "Isomerase Other": {
                "gyrB": 1
            }
        }, 
        "Kinase": {
            "Protein Kinase": {
                "Tyr": {
                    "Tk": {
                        "ABL1": 1, 
                        "PDGFRB": 1, 
                        "PDGFRA": 1
                    }
                }, 
                "Ser_Thr": {
                    "Tkl": {
                        "ALK": 1
                    }, 
                    "Cmgc": {
                        "MAPK11": 1
                    }
                }
            }
        }, 
        "Transferase": {
            "Transferase Other": {
                "COMT": 1
            }
        }, 
        "Lyase": {
            "Lyase Other": {
                "CA5B": 1, 
                "CA5A": 1, 
                "CA7": 1, 
                "POLB": 1
            }
        }, 
        "Oxidoreductase": {
            "Oxidoreductase Other": {
                "PTGS2": 1, 
                "ALOX5": 1, 
                "IMPDH1": 1, 
                "RRM1": 1
            }
        }, 
        "Hydrolase": {
            "Hydrolase Other": {
                "GAA": 1
            }
        }
    }
}
ADD COMMENT
0
Entering edit mode

Thanks for the script, it is very useful. But I still have one more issue to resolve. I generated the json file with my dataset but now the problem is with handling data.

L1,L2,L3,L4,L5,GENE_NAME
Enzyme,Kinase,Protein Kinase,Ser_Thr,Cmgc,MAPK11
Enzyme,Kinase,Protein Kinase,Tyr,Tk,ABL1
Enzyme,Kinase,Protein Kinase,Tyr,Tk,PDGFRB
Enzyme,Kinase,Protein Kinase,Tyr,Tk,PDGFRA
Enzyme,Kinase,Protein Kinase,Ser_Thr,Tkl,ALK
Enzyme,Isomerase,Isomerase Other,,,gyrB
Enzyme,Oxidoreductase,Oxidoreductase Other,,,ALOX5
Enzyme,Oxidoreductase,Oxidoreductase Other,,,IMPDH1
Enzyme,Transferase,Transferase Other,,,COMT
Enzyme,Oxidoreductase,Oxidoreductase Other,,,RRM1
Enzyme,Oxidoreductase,Oxidoreductase Other,,,PTGS2
Enzyme,Lyase,Lyase Other,,,POLB
Enzyme,Lyase,Lyase Other,,,CA5B
Enzyme,Hydrolase,Hydrolase Other,,,GAA
Enzyme,Protease,Metallo,MAM,M10A,MMP8
Enzyme,Lyase,Lyase Other,,,CA5A
Enzyme,Lyase,Lyase Other,,,CA7

As you can see from the data, there are few blank cells in the data set. Which means that L1, L2, L3, GENE_NAME are mandatory fields and L4, L5 instances are not mandatory (may be present or may not be). For instance for a data point when there is no data in L4, L5 the json which I get will be 2 blank quotes. This should be handled and removed.

"Cytosolic other": {
            "Cytosolic other": {
                "": {
                    "": {
                        "MCL1": 1, 
                        "TNNC1": 1
                    }
                }
            }
        }, 
        "Structural": {
            "Structural Other": {
                "": {
                    "": {
                        "TUBA3C": 1, 
                        "TUBB8": 1, 
                        "TUBB4B": 1, 
                        "TUBB4A": 1, 
                        "TUBB3": 1, 
                        "TUBB": 1, 
                        "TUBA4A": 1, 
                        "TUBB1": 2
                    }
                }
            }
        }
}

Could you help me out in handling these data. Much appreciated help. Cheers!

ADD REPLY
1
Entering edit mode

Sure! I edited my answer. This should do the trick.

ADD REPLY
3
Entering edit mode
11.7 years ago

Here is a python script I wrote a while back to produce JSON data for D3.js sunburst diagram which is similar to protovis (same author). What's nice about python is that printing data variables as string is basically JSON. You need to get your data into tab delimited format. You might have to modify the script a little to get it to work with protovis.

import sys
dataStructure = {}
for line in open(sys.argv[1],'r'):
    data = line.strip().split()

    current = dataStructure
    for item in data[:-2]:
        if not current.has_key(item):
            current[item] = {}

        current = current[item]
    if not current.has_key(data[-2]):
        current[data[-2]] = 1
    else:
        current[data[-2]] += 1
print 'var data = ' + str(dataStructure)

Save as script.py and run by:

python script.py myData.tabdelimited > myData.json

For example, here are some sample data:

A    1    F    gene1
A    1    F    gene2
A    2    G    gene3
A    2    G    gene4
A    2    H    gene5
B    3    I    gene6
C    4    J    gene7
C    5    K    gene8
D    6    L    gene9
D    6    M    gene10
D    6    L    gene11

Here is the output of the script:

var data = {'A': {'1': {'F': 2}, '2': {'H': 1, 'G': 2}}, 'C': {'5': {'K': 1}, '4': {'J': 1}}, 'B': {'3': {'I': 1}}, 'D': {'6': {'M': 1, 'L': 2}}}
ADD COMMENT
0
Entering edit mode
10.1 years ago

Working on a Macintosh, I found it helpful to replace the:

​ 'reader = ...'

line with:

reader = csv.reader(open("filename.csv", 'rU'), quotechar='"', delimiter = ',')

This got me past the:

new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

Note this specifies the name of the csv file rather than expecting it as an argument from the command line.

ADD COMMENT

Login before adding your answer.

Traffic: 2029 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6