Hi community, I have a final assignment from an introductory bioinformatics course. My overall idea would be to use an already existing dataset of gene expression from GEO, and use it to construct a gene expression networks. Here is the data set I'd like to work with: https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5232 In this research, the expression of genes was measured in "young", as well as "old" patients, diagnosed with colorectal cancer. Therefor, the aim of my project would be to identify co-expressed genes in the young population and compare this interactions to the old population and vice versa. I'm hope to visually demonstrate, through the network, the change in linkage parameters in specific genes (closeness, betweeness, degree).
I considered moving in three general steps: 1. Process the table: remove null values, average values of the same gene measured by different oligos. Then normalize the values (mean 0, std 1). 2. Produce for the two populations the respectful Pearson correlation matrices (have a look at the demo table I uploaded). from this table I'll, by setting a cut-off (i.e. abs(0.75)), I'll extract just the genes of the the highest correlation. 3. Produce another table/file which is manageable in CytoScape to show the interactions I referred to earlier.
I already have the 1st and second steps (used MATLAB, which is all I know. I'd be happy to share the code, though I'm not graded by it's efficiency etc)
What do you think of the workflow, will it work?
I really need help moving with the data from the second step to CytoScape. If my suggestion of how to use the data is not realistic, please suggest an alternative way of work.
Thanks a bunch.
Thanks for the detailed response!
I'd like to clarify that I don't have any aspirations to develop this to a paper. Just a final assignment. My suggestion had already been approved.
I'd have to carefully read again your final paragraph, as it may point me in the right direction (perhaps I'm off-course now)..
Also, what I've described in my post was not the whole scope of the work. After producing the network, my colleagues and I will move on to focus on selected genes by various tools (TargetScanHuman, MEME motif finding, etc). Here's a link to the "Abstract" of our plan, including the supervisor comments
https://www.dropbox.com/s/y4yn23c3ktk5sbb/Gene%20Expression%20Project%20-%20036867190%20015938046%20034649491%20%28feedback%29.docx?dl=0
Sorry for not having answered sooner. If your plan is to look for shared transcription factor motifs etc. co-expression makes good sense. However, for that purpose I would not convert it to a binary network; you would in my opinion be much better off not applying a cutoff and thus have a weighted network. Then apply clustering to that, e.g. MCL, to identify co-expressed modules and look for shared motifs within each of those.