Hi all,
I'm trying to create a WGCNA network from Illumina microarray data for about 300 samples (multiple diseases from mouse, controls and cases). The data has been normalised using 75th Percentile Shift and only transcripts greater than 1.5 fold relative to the median across all samples have been retained. As a result, the matrix has negative values with a range of -12 to 12.
I've selected different matrices (top 5,000 or top 10,000 transcripts with highest variance) but my R.sq. values are really low (signed network) when I try soft thresholding and never really go higher than 0.4, even when I've set the powers all the way up to 70. Please see attached image. I've also tried unsigned network, and also tried taking absolute values from the matrix (so that the range is from 0 to 12), but nothing really makes any difference.
I'm a bit unsure of what the issue might be? Is it simply that there are too many samples and diseases and there aren't enough strong correlations in the matrix?
Thank you for any help.
Your image has been removed by the host.
Its mentioned in WGCNA FAQ.
6. I can't get a good scale-free topology index no matter how high I set the soft-thresholding power.
Did you try clustering with your data ? Does it differentiates between samples into cases, controls, diseases etc ? Why are you filtering based on variance ? Highly variable genes may also indicate noise in the data. They suggest not to filter the data by any means except removing very low expressed genes.
Hi Goutham
Thanks for your quick reply. The reason why I'm using variance to filter is because I have negative and positive values in my Illumina microarray matrix. And I don't want to filter based on average of absolute values. In the WGCNA FAQ, they say that both mean and variance is fine for filtering
2. Should I filter probesets or genes?
Probesets or genes may be filtered by mean expression or variance (or their robust analogs such as median and median absolute deviation, MAD) since low-expressed or non-varying genes usually represent noise. Whether it is better to filter by mean expression or variance is a matter of debate; both have advantages and disadvantages, but more importantly, they tend to filter out similar sets of genes since mean and variance are usually related.
I have clustered my samples and they don't perfectly separate cases and controls but there is some meaningful structure to it. I've also since just extracted a single disease from the set and ran WGCNA on it, but unfortunately the soft threshold values are still very very low and there is very high connectivity.