Hi all,
Firstly, I am a novice in bioinformatics and much of this is very new to me, so I apologize in advanced for potentially asking very obvious questions or leaving out helpful details.
I am attempting to perform WGCNA on a publicly available scRNA-seq dataset (GSE155578). I got as far as performing the blockwisemodules step, when I discovered my genes were segregating in to only two modules. Although PCA did not clearly indicate any batch effects, previous posts suggested that this data could have batch effects, in part influenced by a global expression driver used in the collection of the cells. In reading through the WGCNA literature, they suggest using ComBat-seq for batch correction (a failed run of ComBat-seq I tried identified two batches in the data). This is where my problem begins. ComBat-seq uses a user-defined batch vector to identify and correct effects. As I understand it, this should be a vector indicating which samples are derived from which batches, and I am unsure how to provide a batch vector for a dataset that I did not create (metadata also doesn't provide any batch information either?). Some have suggested utilizing svaseq to determine the batch(es), and use that with ComBat-seq, but I have found limited resources on how to do this. I sincerely appreciate any help on how to overcome this issue.
svaseq determines surrogate variables of variation, not batches in the sense of 'this is batch A, this is B, this is C...'. If you don't see evidence of clear batch effects then I am kot convinced you should blindly correct for it. Is WGCNA good for sparse single-cell, so is there literature references that benchmark this and recommend it? You could also try NMF to find genes that are somewhat correlated in expression space. RcppML has a fast implementation for that.
Thank you for your comment! I will give RcppML a try!
The two indications I had for batch effects came from the clustering of genes into only two modules, and the fact that a preliminary run of ComBat-seq (run failed due to the lack of appropriate batch vector on my part) on this dataset identified the presence of two batches.
I had not previously considered the sparsity of the single-cell data to be an issue. With your help, I found a Seurat function for single-cell WGCNA that takes in to account the sparsity of sc data for WGCNA. In my experience with Seurat in the past, users must input 10x formatted data (separate files for genes, barcodes, and matrices) to create Seurat formatted objects. As this data is not formatted in such a way, what would you suggest? Is there an efficient way to use non-10x formatted data with Seurat? Or is there a way to reformat the data to fit these constraints?