Hello, I am a long-time lurker but new poster to Biostars, so apologies in advance for any improper forum etiquette! I thank you for your patience in advance, I have had some formal bioinformatics training through a research fellowship but am largely self-taught.
I have a question involving the downstream analysis of publicly available data from GEO. I am hopeful to utilize GSE124814 as a large dataset to confirm trends seen in a smaller RNAseq dataset of tumor samples obtained by my group.
GSE124814 is a compiled expression set of 23 medulloblastoma datasets, with 1350 tumor samples and 291 control samples. I am working with a subset of GSE124814 of 233 tumor samples and 291 control samples. The removal of unwanted variation (RUV) method was applied by the authors to account for batch effects, and the entire matrix was quantile normalized.
I am interested in performing enrichment analysis (through GSEA), and weighted gene coexpression network analysis (WGCNA) on these data, but after consulting the literature I am not clear if the normalization methods used lend themselves to the analyses I wish to perform. Obviously, I could avoid the confusion by obtaining raw .CEL files and processing them myself, but I lack the expertise (and the hardware) to confidently reproduce an analysis of the same size.
The resulting expression matrix contains expression values roughly centered at 0, with expression values ranging from [-7.7 : 9.4]. Honestly, I am unsure how to interpret negative expression values like this, it almost seems to be log scaled though I could not find anything on this in the literature. I have been considering an exponential transformation of the data to work from a positive distribution, would this be reasonable? Please find the density plot below to observe the distribution.
Am I able to use the values as they are for analysis? I performed a very basic differential expression analysis through a Mann-Whitney-U test for some preliminary filtering as I was unsure if these data are suited for limma/edgeR. I have not yet attempted large-scale GSEA of the data. I have performed WGCNA, and the resulting network did not show robust connectivity unless I lower the power threshold to approximately 4, though the scale free analysis (see below) suggests a power of 10-12 to be more appropriate to construct the adjacency matrix. Further, the WGCNA literature suggests a power of 12 for a signed network using data of this size.
tl;dr: How do I interpret data processed through RUV/quantile normalization, can I use this for WGCNA/GSEA, which differential expression algorithm would be most suited for data of this type?
Thanks in advance.
Hi Dave! I am a med student working on a similar project. Can you please let me know how did you go about this then? Thanks a lot!
I've moved your post to a comment - don't add answers unless you're answering the top level question.