Question

WGCNA input genes - greatest variance?

0

Entering edit mode

14 months ago

fluentin44 ▴ 20

Hi all,

I'm doing WGCNA with RNASeq expression data - ive seen examples of supplying the (vst normalised) count matrix to WGCNA in its totality and some instance of those genes with the greatest variance, i.e. the top 25% or top 5000 genes. When I do the former the scale free model plot looks normal with a clear soft power of 8. When I do the latter the graphs look bizarre as below. Is this as a result of picking 5000 genes in the dataset with greatest variance?

# Removal of missing genes / outlier samples ------------------------------

gsg <- goodSamplesGenes(t(datExpr)) # filters samples and genes with too many missing entries
summary(gsg)
gsg$allOK # Indicates whether there are genes or samples which are outliers in boolean

table(gsg$goodGenes)
table(gsg$goodSamples)

# remove genes that are detected as outliers
datExpr <- datExpr[gsg$goodGenes == TRUE,]

# if (!gsg$allOK)
# {
#   # Optionally, print the gene and sample names that were removed:
#   if (sum(!gsg$goodGenes)>0) 
#     printFlush(paste("Removing genes:", paste(names(datExpr)[!gsg$goodGenes], collapse = ", ")));
#   if (sum(!gsg$goodSamples)>0) 
#     printFlush(paste("Removing samples:", paste(rownames(datExpr)[!gsg$goodSamples], collapse = ", ")));
#   # Remove the offending genes and samples from the data:
#   datExpr = datExpr[gsg$goodSamples, gsg$goodGenes]
# }



# Explicitly clustering and removing samples ------------------------------

sampleTree = hclust(dist(t(datExpr)), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
#pdf(file = "Plots/sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, 
     cex.axis = 1.5, cex.main = 2)


samples.to.be.excluded <- c('BA5_02')
datExpr <- datExpr[,!(colnames(datExpr) %in% samples.to.be.excluded)]

sampleTree = hclust(dist(t(datExpr)), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
#pdf(file = "Plots/sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, 
     main = "Sample clustering to detect outliers sample removed", 
     sub="", xlab="", 
     cex.lab = 1.5, 
     cex.axis = 1.5, 
     cex.main = 2)



# Make deseq2 object and exclude low count --------------------------------

# # Plot a line to show the cut
# abline(h = 200000, col = "red");
# # Determine cluster under the line
# clust = cutreeStatic(sampleTree, cutHeight = 200000, minSize = 10)
# table(clust)
# # clust 1 contains the samples we want to keep.
# keepSamples = (clust==1)
# datExpr = data[keepSamples, ]
# nGenes = ncol(datExpr)
# nSamples = nrow(datExpr)


# Making deseq2 matrix and removing lower count genes ---------------------

datTraits <- 
  hamster_pheno %>% 
  filter(!row.names(.) %in% samples.to.be.excluded)

names(datTraits)

all(rownames(datTraits) %in% colnames(datExpr))
all(rownames(datTraits) == colnames(datExpr)) # Same order

# data <- datExpr 

dds <-
  DESeqDataSetFromMatrix(countData = datExpr,
                         colData = datTraits,
                         design = ~ 1) # not specifying model


## remove all genes with counts < 15 in more than 75% of samples (31*0.75=23.25)
## suggested by WGCNA on RNAseq FAQ

sample_no <- ncol(datExpr)
sample_threshold <- sample_no * 0.75
dds75 <- dds[rowSums(counts(dds) >= 15) >= sample_threshold,]
nrow(dds75) # 13284 genes

# perform variance stabilization
dds_norm <- vst(dds75)

sampleTree = hclust(dist(t(assay(dds_norm))), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
#pdf(file = "Plots/sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, 
     main = "Sample clustering of normalised data to detect outliers sample removed", 
     sub="", xlab="", 
     cex.lab = 1.5, 
     cex.axis = 1.5, 
     cex.main = 2)


# get normalized counts - Rows samples, columns gene IDs
datExpr <-
  assay(dds_norm) %>%
  t()

# Calculate variance for each gene
gene_variance <- apply(assay(dds_norm), 1, var)

# Option 1: Select top 25% of genes by variance
top_variance_genes <- names(sort(gene_variance, decreasing = TRUE))[1:(length(gene_variance) * 0.25)]

# # Option 2: Select top 5000 genes by variance (if less than total number of genes)
top_variance_genes_5000 <- names(sort(gene_variance, decreasing = TRUE))[1:min(5000, length(gene_variance))]
#
# # Choose one of the options based on your criteria
# # For example, if using top 25%:
selected_genes <- top_variance_genes_5000

# Filter datExpr to include only selected genes
datExpr_filtered <- datExpr[selected_genes, ]

datExpr <- datExpr_filtered %>% t()

collectGarbage()


# Binarise catagorical data -----------------------------------------------

temp <- datTraits
temp_2 <- 
  temp %>% 
  binarizeCategoricalColumns(c("group", "sex"))

# datTraits_bin <- 
#   bind_cols(temp, temp_2) %>% 
#   dplyr::select(-ab)
datTraits_bin <- temp_2

sampleTree2 <- hclust(dist(datExpr), method = "average")
traitColors <- numbers2colors(datTraits_bin)

# Check samples are clustering together
plotDendroAndColors(sampleTree2,
                    traitColors,
                    groupLabels = colnames(datTraits_bin),
                    main = "Sample dendrogram and trait heatmap normalised")



# Pick soft power threshold -----------------------------------------------

# Choose a set of soft-thresholding powers
power <- c(c(1:10), seq(from = 12, to = 50, by = 2))

# Call the network topology analysis function
sft <- pickSoftThreshold(datExpr,
                         powerVector = power,
                         networkType = "signed",
                         verbose = 5)


sft.data <- sft$fitIndices

# visualization to pick power

a1 <- 
  ggplot(sft.data, aes(Power, SFT.R.sq, label = Power)) +
  geom_point() +
  geom_text(nudge_y = 0.1) +
  geom_hline(yintercept = 0.8, color = 'red') +
  labs(x = 'Power', y = 'Scale free topology model fit, signed R^2') +
  theme_classic()


a2 <- 
  ggplot(sft.data, aes(Power, mean.k., label = Power)) +
  geom_point() +
  geom_text(nudge_y = 0.1) +
  labs(x = 'Power', y = 'Mean Connectivity') +
  theme_classic()

grid.arrange(a1, a2, nrow = 2) # Selected 18 as its above threshold and low mean connectivity

enter image description here

WGCNA RNA-Seq • 1.1k views

ADD COMMENT • link updated 14 months ago by andres.firrincieli 3.9k • written 14 months ago by fluentin44 ▴ 20

0

Entering edit mode

When I do the latter the graphs look bizarre as below. Is this as a result of picking 5000 genes in the dataset with greatest variance?

by subsetting the top 25% genes with the greatest variance you are likely selecting subset of DEG whose expression profile is affected by a strong driver of variation. If you run a PCA on the vst transformed expression data of the 5000 genes you will likely see 2 cluster of samples

ADD REPLY • link 14 months ago by andres.firrincieli 3.9k

0

Entering edit mode

No looked the same as previous as the DESeq2::plotpca function takes the rows with the greatest variance anyway.

I just dont really understand why I have this odd bounce from 1-3, 3-10 and back up again without meeting the 0.8 threshold.