Hi, I'm attempting to construct a consensus WGCNA network for 4 related expression datasets. For a signed network, two of the datasets achieve a high scale-free topology very early, but for the other two they don't reach 0.8 until a very high power.
Here, the last of four datasets achieves a scale-free topology at 28, although there appears to be a slight peak in this metric at 17.
A power of 28 seems high to me (although I may be wrong), especially when the WGCNA FAQ recommends choosing a maximum power of 18 for signed datasets when presented with a lack of scale free topology. Is there any argument to be made for 17 since the graph shows that slight peak? https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html
If it makes more sense to consider our multi-expression data to lack scale-free topology, suggesting we choose a power from the table below, how should we determine our sample size from column 1?
Should we consider our total number of samples across all 4 datasets (>70), or should we choose the average, minimum, or maximum number of samples per dataset (<20)?
In a related question, does the scale-free topology look better if we use an unsigned network (please ignore the graphical numbering glitch)?
Based on these results, it seems we should choose a power of 11 since it is the lowest power where all 4 datasets have a scale-free topology >0.8.
If we were to choose an unsigned network instead, is it valid to run downstream statistics on eigengene values from an unsigned network even though modules would contain sets of genes with both positive and negative correlations? Since the eigengene values represent principal component 1 for a given module, I assume they represent the strongest expression signal within a module and would not combine signals from both positively and negatively correlated gene sets. Am I wrong in this assumption? Are the genes that contribute to PC1 in each module held consistent when producing eigengene values across all samples and datasets?
Thank you for all your help!
Hi,
Can you post the PCA plot?
Sure, here it is. By the way, this is based on a VST normalized dataset from DESeq2. Is it possible a r-log transformation could yield a better scale-free topology? The PCA for an r-log transformed dataset looks nearly identical.
Here, PC1 explains differences in tissue and PC2 explains genetic lineage, so the results capture the expected drivers in variation between datasets. There is more variation within intestines than brains, but perhaps that is also expected as well since intestine would be more impacted by external stimuli via diet.
First though, If the original plan was to build a network for each group of samples I am afraid you can't do that because after removing some of the samples that look like outliers in the LS.INT and WS.INT you will not have enough samples for network analysis. Second thought, there seems to be a lot of within-group variability in the LS.INT and WS.INT groups which could negatively affect the dispersion estimates. The only solution to these problems is to
From the consensusTOM, you can still get modules of co-expressed genes between the 4 groups of samples
Thank you for your thoughts. We are perhaps more interested in the differences between LS and WS, while differences between intestine and brain are interesting but more serve as a sort of reciprocal control to put differences between LS and WS in context. For example, if we see the same module variance between WS and LS in both tissues we assume these could be system-wide expression patterns or if we see an interaction between both variables we assume expression differences between LS and WS could be tissue-specific.
Fortunately, we weren't intending on making 4 separate networks. We planned to make a single consensus network using all four individual datasets, then compare module preservation between each of the four pairs of datasets and run DE-like statistics on the resulting module eigengenes. Will building a single consensus network from intestine and brain datasets differ from a consensus network from all 4 datasets? If we build a consensus network from just 2 datasets (intestine and brain) instead of 4, would our above experimental question and proposed analyses still be valid/possible?
I should also mention this is a filtered dataset. The VST was first run using all genes for all samples, but we then removed genes that didn't have raw counts of at least 10 copies in every single sample. This was done to ensure that any significant results and observed interactions were not primarily driven by lack of expression in one of the two tissue types. Is this a reasonable way to proceed?
*We also already remove some outlier samples based on the sample trees. The remaining "outliers" did not seem so different from others within their dataset based on these trees.
You could build a consensus network from the 2 datasets (intestine and brain) (be sure that the intestine and brain dataset shares the same genes) and then a tissue-specific network only with the brain samples. Then, for the preservation analysis, you can use the brain-specific network as a reference to check if the brain.WS or brain.LS modules are also preserved in the consensus network.
Sounds good to me. You should first remove the low count genes and then proceed with the VST normalization.
Thank you for your response, and sorry for my delayed reply, but may I ask for some clarifications on this point? Do you mean to say that we could only do an WS vs. LS preservation rank comparison for the brain in this way, but not for intestine, and that the intestine samples must be grouped for the creation of a valid network? WGCNA is still a little confusing for me, but how might a consensus network between two datasets differ from a consensus network using four datasets if the exact same samples are used each time?
Could you share the normalized expression matrix of all the genes shared between the datasets so I can show you all the possible comparison you can do in the analysis? You can replace the gene names with a fake gene id.
I've uploaded the subsetted datasets for each of the conditions, split up by tissue only and alternatively by both tissue and stock (LS vs. WS). This folder has both the VST expression matrices and colData for each subset, but if you prefer all samples together you can do an rbind/cbind with the "vst.BRN" and "vst.INT" datasets. Hope this format works for you, and thanks again for you help.
https://drive.google.com/drive/folders/1kcEnYrKS8cZrddPIzD70ioFTJy6gtfEG?usp=share_link
I went ahead and subsetted the dataset by tissue type and to see if there's as much within group variability and overlap without the overwhelming signal of PC1 (from the full dataset). It maybe looks a little better for intestine samples this way, but looking at the unfiltered dataset gives us a much better separation of LS and WS, except for possibly one remaining outlier.
The resulting scale-free analysis from the unfiltered dataset also looks slightly better, with all 4 samples crossing the 0.8 threshold at a power of 12 for a signed network (again, ignore the weird numbering glitch).
Unsigned scale free analysis is a little messier, with both intestine datasets approaching 0.8 at a power of 9 and finally surpassing it at 15 or 16.
If we ran this analysis on all genes we expect it would give us false positive expression differences for eigengenes due to the fact many genes are likely tissue-specific, since we are mainly targeting interactions. Therefore, analysis on the filtered dataset is preferred, but if it's statistically dubious we can run your proposed pipeline or deal with false positives in a different way. Thoughts?
As far as I can tell you, all the dataset included in the WGCNA analysis must share the same genes.