Hello everyone,
I am working with an RNA-seq data set from the environment, so it's a mixed community with over 200,000 contigs assembled. I don't have many people to discuss my methods with and it would be very helpful to get your feedback on how I did this analysis.
For the WGCNA analysis, instead of working on the individual contig/gene level, I aggregate to the KEGG ID level and now have a list of about 1,500 IDs. This creates a composite of all the different genera/species/strains and allows me to work on a functional level. Broadly, I am curious which processes are more likely associated with certain environmental conditions, and the analysis seems to have worked out for this purpose.
I followed the manual pretty closely, and did not adjust p values. I merged modules and only have 4 - I am ok with this because otherwise, the modules were showing very similar correlations with traits. I wanted to group these together and minimize redundancy.
My first question is if the module membership p value (unadjusted) is not significant for some of the genes of interest (although many are). Can/should you still discuss these genes? I interpret this to mean that the correlation wasn't robust enough to be statistically significant, although the gene still clustered in this module and not in the "gray" module. Some of these non-significant genes, clustering along with others that were significant in that module, are showing an interesting pattern that is consistent with the literature.
Second question - is it routine to adjust all of the wgcna p values? Why is the default to not do this then? I'm having trouble figuring out when you absolutely should do this (to follow wgcna best practices), and when it doesn't make sense to based on the complications of your less than ideal and messy environmental data set.
It's difficult making these calls and so I would be grateful for your feedback.
Hello and thank you for the reply!
Yes, one could absolutely take a look at taxa-specific gene expression using WGCNA. I am considering this for future analyses. For the present one, I was simply curious which metabolic processes were overtly different between regions, and therefore was sticking to all taxa within a certain phylum. The extra level of detail could be interesting regarding which taxa contributed to individual signals but I decided to keep it broad for the purposes of this analysis. Does this make sense? Also, 200,000 genes is nontrivial to handle and it seems that is not the norm for WGCNA. I wasn't able to build the network locally on my computer, although this could be done on a remote computer with more RAM.
The kME Pearson correlation is associated with a p value and both are given by WGCNA. For instance, I have a gene with a weak correlation is 0.1 to this module eigengene and a non-significant p value, but it was assigned to the same module as many other biologically related genes. Can you say anything about this gene? Or toss it out?
My second questions is for all of wgcna correlations - module/trait, and kME. Is it routine to adjust those p values?
You do not need to build a network for the 200,000 transcripts. That is just too much. Since you already have the network, just look at the taxa that contributes to the 'expression' of the KOs with the highest kME (the hub of the modules).
You shoudl toss it out. I would focus my analysis on the KO with the highest kME. However, I would consider the KO with low kME only if they are involved in the same pathway of the KO included in the hub.
It is not part of the WGCNA pipleine, so I do not think is necessary.
One question. signedKME is the function used to calculate the kME. Just out of curiosity, how did you get the p-value?
Ok thank you so much! So helpful to get another person's take on this.
I actually did not use signedKME to get module membership. I followed the manual:
That sounds!
Have fun