Proper construction of data matrix for WGCNA (weighted gene coexpression network analysis)
1
1
Entering edit mode
8.4 years ago
themantalope ▴ 40

Hi All,

I would just like some clarification of terminology regarding a detail of gene coexpression network construction. Let's say I have two RNA-seq datasets, each dataset containing n replicates, and each dataset representing sequencing data from the same biological system in two different experimental conditions. How should I construct the data matrix for input to something like WGCNA if I want to analyze gene coexpression networks across experimental conditions/interventions?

What I imagine is that each row of the matrix represents data from one gene, and each column represents data collected from one of the replicates in an experimental condition. So for example, one particular row of the matrix would look like this:

          c1R1 ... c1Rn c2R1 ... c2Rn
  gene x [val, ... val, val, ... val]

Where the first column c1R1 corresponds to the data from the first experimental replicate in the first condition, and the last column c2Rn corresponds to the nth experimental replicate in the 2nd experimental condition. For coexpression analysis, each row is then correlated with every other row in a pairwise fashion, an adjacency matrix is constructed from the correlation analysis and then other analyses such as module detection can be conducted based on the resulting adjacency matrix.

I just want to verify that this is an appropriate method for organizing data if one wishes to construct coexpression networks for genes "across an intervention".

RNA-Seq coexpression WGCNA • 2.5k views
ADD COMMENT
5
Entering edit mode
8.4 years ago
keith.hughitt ▴ 280

Hi mantale1,

That's exactly correct. By including replicates from both conditions, network will reflect both the specific pathways that are co-regulated during your condition of interest, as well as whatever genes are constitutively expressed in the organism.

If you were to then start added samples from other unrelated conditions, you would be both improving the accuracy of the global co-expression network due to the increased information, but also would be reducing the signal resulting from the intervention you are interested in.

Couple things you might consider:

1) Depending on the number of replicates you have for each condition, you may end up with a very noisy co-expression network. Most of the methods were developed for microarray data where you are likely to have many more samples. With less then 10 replicates across both conditions, you are likely have a large number of spurious correlations.

2) You might consider filtering out genes which are not differentially expressed across your intervention. This will help both with eliminating spurious correlations, and also help to bring out the signal specifically due to the intervention.

Keith

ADD COMMENT
3
Entering edit mode

Hi Keith! Thanks for the comment, I really appreciate it. This term was thrown around a lot in literature and I just wanted to make sure that I was interpreting it correctly. In general, I have been considerate of the concerns you raised in points 1 and 2. In addition, I would also add (for other readers that are perhaps new to the technique) that interpreting coexpression networks within some other biological context is crucial, and what the utility of the coexpression analysis is should be understood a priori. For example, does one intend to use the coex-network to discover regulatory hubs from a poorly understood disease state, or does one wish to understand the co-regulatory structure of a well understood set of genes in a specific experimental condition (these are just 2 examples, there are many other possibilities).

ADD REPLY

Login before adding your answer.

Traffic: 2000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6