Question

Co-expression network using a dataset from GEO

2

Entering edit mode

8.5 years ago

omer.k ▴ 110

Hi community, I have a final assignment from an introductory bioinformatics course. My overall idea would be to use an already existing dataset of gene expression from GEO, and use it to construct a gene expression networks. Here is the data set I'd like to work with: https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5232 In this research, the expression of genes was measured in "young", as well as "old" patients, diagnosed with colorectal cancer. Therefor, the aim of my project would be to identify co-expressed genes in the young population and compare this interactions to the old population and vice versa. I'm hope to visually demonstrate, through the network, the change in linkage parameters in specific genes (closeness, betweeness, degree).

I considered moving in three general steps: 1. Process the table: remove null values, average values of the same gene measured by different oligos. Then normalize the values (mean 0, std 1). 2. Produce for the two populations the respectful Pearson correlation matrices (have a look at the demo table I uploaded). from this table I'll, by setting a cut-off (i.e. abs(0.75)), I'll extract just the genes of the the highest correlation. 3. Produce another table/file which is manageable in CytoScape to show the interactions I referred to earlier.

I already have the 1st and second steps (used MATLAB, which is all I know. I'd be happy to share the code, though I'm not graded by it's efficiency etc)

What do you think of the workflow, will it work?

I really need help moving with the data from the second step to CytoScape. If my suggestion of how to use the data is not realistic, please suggest an alternative way of work.

Thanks a bunch.

RNA-Seq Gene Cytoscape • 3.0k views

ADD COMMENT • link updated 8.5 years ago by Lars Juhl Jensen 11k • written 8.5 years ago by omer.k ▴ 110

score 2 · Answer 1 · 2017-02-27

Whether it will work really depends on what you mean by "work".

1) Can it be done? Yes. You can calculate Pearson correlation coefficients between genes for each of two subsets of samples. You can apply an arbitrary cutoff to them and obtain two networks. You can calculate various network parameters for the nodes in the two networks and compare them. You can load everything into Cytoscape and nodes/edges or even create an animation that morphs the young network into the old if you want. There is no doubt that what you propose can be done.

2) Would it work as a final assignment? Probably. It obviously depends on the course and what the professor wants to evaluate. It shows that you are able to actually perform some hands-on analyses. It gives you opportunity to show that you understand what a co-expression network is, what the network parameters mean, and how to interpret them. It will also allow you to show that you can critically appraise your results. As a final assignment, it would thus work, assuming that it is within scope of the course.

3) Would it work from a scientific standpoint? No. I honestly do not think this will be able to give useful insights into development of colorectal cancer in young and old. There are numerous reasons for this: the number of samples in each category is quite low for making a co-expression network, these networks are quite messy even if based on many samples, applying an arbitrary cutoff to make co-expression binary is problematic, using the same cutoff for two networks based on different numbers of samples is problematic, differences between the two networks will likely be dominated by "noise", and many of the network metrics may not be meaningful for this type of network. I would thus be highly sceptical of any results coming out of it.

Point #3 is obviously deeply problematic if the goal was to make a scientific paper. However, it is also very much what allows you to show that you can think critically, so from an assignment perspective it is not all bad.

If I were to make a biological network analysis of the dataset in question, my approach would be very different. Briefly, I would analyze the expression data to identify differences at the level of individual genes between classes of samples. I would map this onto an external network (i.e. the edges would not be derived from the expression data) and identify clusters or modules within the network where you see interacting genes showing a similar behavior. Precisely because the edge information was external, it can serve as independent evidence and thus allow me to find groups of genes that are likely to be more relevant in the disease context that was was found in the initial gene-level analysis. I would then use text mining to help identify literature relevant to each of the identified module and based on that put each module in the context of what is know. That, however, is probably way too much work for an assignment and almost certainly involves methodologies outside the scope of the course.