Reading gene expression data from an Excel spreadsheet into a python dictionary
5
0
Entering edit mode
10.7 years ago
xgeneral57 ▴ 10

I have an Excel spreadsheet (v14.4.1 for Mac) containing gene expression data in different conditions:

     cond1  cond2   cond3
gene1  1.57   2.52   12.05
gene2  0.01   8.90   31.20   
gene3  57.05  12.14  50.76
...

(i.e., the first row and the first column of the spreadsheet contain the condition names and gene names, respectively)

I would like to read this Excel sheet into a python dictionary whose keys are tuples containing the gene and condition names:

myDict[('gene1','cond1')] = 1.57
myDict[('gene1','cond2')] = 2.52
myDict[('gene1','cond3')] = 12.05
myDict[('gene2','cond1')] = 0.01
...

It think I need to use cvs.DictReader (after saving the spreadsheet as a cvs file), but, I am not sure how to do this.

gene-expression import python • 7.6k views
ADD COMMENT
1
Entering edit mode

What do you plan to do with the data once it has been read and stored? Just asking to ascertain that you're using Python for the right reasons.

ADD REPLY
0
Entering edit mode

I am doing some custom analysis using optimization solvers.

ADD REPLY
2
Entering edit mode
10.7 years ago
fbrundu ▴ 350

Hi,

I suggest you to take a look at Pandas library. It offers you some useful features like dataframes where you can store gene matrices and do a lot of things with them.

In your case it will be (supposing you saved your datasheet as csv):

matrix = pandas.read_table('data.csv', index_col=0)
ADD COMMENT
2
Entering edit mode
10.7 years ago

Save your Excel spreadsheet as a tab-delimited text file (File > Save As...) and then use the following script to convert it to a tuple-keyed Python dictionary:

#!/usr/bin/env python

import sys

f = sys.stdin
l = f.readline().rstrip("\r\n")
conds = l.split("\t")
conds.pop(0)

d = dict()
for l in f:
    l = l.rstrip("\r\n")
    vals = l.split("\t")
    gene = vals[0]
    vals.pop(0)
    valIdx = 0
    for val in vals:
        cond = conds[valIdx]
        d[(gene, cond)] = float(val)
        valIdx += 1

for tupleKey in d.keys():
    print '\t'.join([str(tupleKey), str(d[tupleKey])])

You could run it as follows (here, I'm using example data from another tab-delimited matrix file):

$ tdf2tupleDict.py < myData.tdf
('NCI_H460-DS23114', 'fIntestine_Sm')   0.174936
('PANC1-DS9955', 'fPlacenta')   0.446492
('LNCap-DS14684', 'fAdrenal')   0.699236
...

This assumes that gene and condition names are unique, else there is key collision and values are overwritten.

Further, as you probably know, dictionary keys are usually not stored in any particularly useful order; I include the print loop only to demonstrate lookups. You'd need to do extra work to get keys into a specific order, but hopefully this answer addresses the parameters of your specific request (namely, a Python dictionary using tuples as keys).

ADD COMMENT
1
Entering edit mode
10.7 years ago

Hi Xgeneral57,

If you can save your data as a two-column TSV or CSV file without header, then you can use the following two python functions to load them up or save to a TSV file:

import csv

def write_dict_file_tsv(mydict,filename):
    writer = csv.writer(open(filename, 'wb'),delimiter='\t')
    for key, value in mydict.items():
        writer.writerow([key, value])   

def read_dict_file_tsv(filename):
    reader = csv.reader(open(filename, 'rb'),delimiter='\t')
    mydict = dict(x for x in reader)
    return mydict

change delimiter to ',' for a CSV file:

Use them as follows:

mydict={}
mydict=read_dict_file_tsv("input.tsv")
write_dict_file_tsv(mydict,"output.tsv")

input.tsv file is:

gene1[SPACE]cond1[TAB]1.57
gene1[SPACE]cond2[TAB]2.52
gene1[SPACE]cond3[TAB]12.05
gene2[SPACE]cond1[TAB]0.01

Once you load this file up, split the keys by using .split(" ") to get gene name and condition out.

To generate input.tsv in the above format, save your worksheet as a TSV file and then use the following one-liner by redirecting the output of the one-liner > input.tsv:

$ cat test.tsv
Genes    cond1    cond2    cond3
gene1    1.57    2.52    12.05
gene2    0.01    8.90    31.20   
gene3    57.05    12.14    50.76

$ perl -ane '$.==1?@n=@F:map{$r{$F[0]." ".$n[$_]}=$F[$_]}(1..$#F)}{print "$_\t$r{$_}\n" foreach (keys%r);' test.tsv
gene3 cond1    57.05
gene1 cond2    2.52
gene3 cond2    12.14
gene2 cond2    8.90
gene3 cond3    50.76
gene1 cond1    1.57
gene1 cond3    12.05
gene2 cond1    0.01
gene2 cond3    31.20

Best Wishes,
Umer

ADD COMMENT
1
Entering edit mode
10.7 years ago
pld 5.1k

This is pretty trivial, you don't even need the dictonary reader, should be something like this:

import csv
import sys
infile = sys.argv[1]
results = dict()
with open(infile, "rb") as fi:
    data = list(csv.reader(fi, delimiter = ","))[1:]
for line in data:
    for x in xrange(1,len(line)):
        results[(line[0], "cond".format(x))] = line[x]

This assumes that the first column (line[0]) will be the gene ID, and the rest of the columns are conditions (line[1:]).

ADD COMMENT
0
Entering edit mode
10.7 years ago
xgeneral57 ▴ 10

Thanks a lot everybody for the replies. All suggestions work and they were extremely useful. I finally ended with the following short piece of code by modifying what joe.cornish826 suggested (instead of a csv file I had to save the Excel sheet as Windows Formatted Text though as importing into CVS using Excel 2011 for mac results in control+M chracters instead of end of line):

import sys
import csv
with open('data.txt','rb') as inputfile:
    rawData=list(csv.reader(inputfile, delimiter = "\t"))

rawData
[['', 'col1', 'col2', 'col3'], ['row1', '1.57', '2.52', '12.05', ''], ['row2', '0.01', '8.90', '31.20'], ['row3', '57.05', '12.14', '50.76']]

dictData = dict([((row[0],col),float(row[rawData[0].index(col)])) for row in rawData[1:] for col in rawData[0][1:]])

dictData
{('row3', 'col1'): 57.05, ('row3', 'col2'): 12.14, ('row1', 'col2'): 2.52, ('row3', 'col3'): 50.76, ('row1', 'col3'): 12.05, ('row1', 'col1'): 1.57, ('row2', 'col3'): 31.2, ('row2', 'col2'): 8.9, ('row2', 'col1'): 0.01}
ADD COMMENT
0
Entering edit mode

To deal with Microsoft's version of line breaks in OS X or Linux, run a program like dos2unix on your matrix file. Then you can manipulate the file with standard Unix tools, Python scripts, etc.

ADD REPLY

Login before adding your answer.

Traffic: 1856 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6