Entering edit mode
10 months ago
fluentin44
▴
20
Hi all,
I'm doing WGCNA with RNASeq expression data - ive seen examples of supplying the (vst normalised) count matrix to WGCNA in its totality and some instance of those genes with the greatest variance, i.e. the top 25% or top 5000 genes. When I do the former the scale free model plot looks normal with a clear soft power of 8. When I do the latter the graphs look bizarre as below. Is this as a result of picking 5000 genes in the dataset with greatest variance?
# Removal of missing genes / outlier samples ------------------------------
gsg <- goodSamplesGenes(t(datExpr)) # filters samples and genes with too many missing entries
summary(gsg)
gsg$allOK # Indicates whether there are genes or samples which are outliers in boolean
table(gsg$goodGenes)
table(gsg$goodSamples)
# remove genes that are detected as outliers
datExpr <- datExpr[gsg$goodGenes == TRUE,]
# if (!gsg$allOK)
# {
# # Optionally, print the gene and sample names that were removed:
# if (sum(!gsg$goodGenes)>0)
# printFlush(paste("Removing genes:", paste(names(datExpr)[!gsg$goodGenes], collapse = ", ")));
# if (sum(!gsg$goodSamples)>0)
# printFlush(paste("Removing samples:", paste(rownames(datExpr)[!gsg$goodSamples], collapse = ", ")));
# # Remove the offending genes and samples from the data:
# datExpr = datExpr[gsg$goodSamples, gsg$goodGenes]
# }
# Explicitly clustering and removing samples ------------------------------
sampleTree = hclust(dist(t(datExpr)), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
#pdf(file = "Plots/sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
cex.axis = 1.5, cex.main = 2)
samples.to.be.excluded <- c('BA5_02')
datExpr <- datExpr[,!(colnames(datExpr) %in% samples.to.be.excluded)]
sampleTree = hclust(dist(t(datExpr)), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
#pdf(file = "Plots/sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree,
main = "Sample clustering to detect outliers sample removed",
sub="", xlab="",
cex.lab = 1.5,
cex.axis = 1.5,
cex.main = 2)
# Make deseq2 object and exclude low count --------------------------------
# # Plot a line to show the cut
# abline(h = 200000, col = "red");
# # Determine cluster under the line
# clust = cutreeStatic(sampleTree, cutHeight = 200000, minSize = 10)
# table(clust)
# # clust 1 contains the samples we want to keep.
# keepSamples = (clust==1)
# datExpr = data[keepSamples, ]
# nGenes = ncol(datExpr)
# nSamples = nrow(datExpr)
# Making deseq2 matrix and removing lower count genes ---------------------
datTraits <-
hamster_pheno %>%
filter(!row.names(.) %in% samples.to.be.excluded)
names(datTraits)
all(rownames(datTraits) %in% colnames(datExpr))
all(rownames(datTraits) == colnames(datExpr)) # Same order
# data <- datExpr
dds <-
DESeqDataSetFromMatrix(countData = datExpr,
colData = datTraits,
design = ~ 1) # not specifying model
## remove all genes with counts < 15 in more than 75% of samples (31*0.75=23.25)
## suggested by WGCNA on RNAseq FAQ
sample_no <- ncol(datExpr)
sample_threshold <- sample_no * 0.75
dds75 <- dds[rowSums(counts(dds) >= 15) >= sample_threshold,]
nrow(dds75) # 13284 genes
# perform variance stabilization
dds_norm <- vst(dds75)
sampleTree = hclust(dist(t(assay(dds_norm))), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
#pdf(file = "Plots/sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree,
main = "Sample clustering of normalised data to detect outliers sample removed",
sub="", xlab="",
cex.lab = 1.5,
cex.axis = 1.5,
cex.main = 2)
# get normalized counts - Rows samples, columns gene IDs
datExpr <-
assay(dds_norm) %>%
t()
# Calculate variance for each gene
gene_variance <- apply(assay(dds_norm), 1, var)
# Option 1: Select top 25% of genes by variance
top_variance_genes <- names(sort(gene_variance, decreasing = TRUE))[1:(length(gene_variance) * 0.25)]
# # Option 2: Select top 5000 genes by variance (if less than total number of genes)
top_variance_genes_5000 <- names(sort(gene_variance, decreasing = TRUE))[1:min(5000, length(gene_variance))]
#
# # Choose one of the options based on your criteria
# # For example, if using top 25%:
selected_genes <- top_variance_genes_5000
# Filter datExpr to include only selected genes
datExpr_filtered <- datExpr[selected_genes, ]
datExpr <- datExpr_filtered %>% t()
collectGarbage()
# Binarise catagorical data -----------------------------------------------
temp <- datTraits
temp_2 <-
temp %>%
binarizeCategoricalColumns(c("group", "sex"))
# datTraits_bin <-
# bind_cols(temp, temp_2) %>%
# dplyr::select(-ab)
datTraits_bin <- temp_2
sampleTree2 <- hclust(dist(datExpr), method = "average")
traitColors <- numbers2colors(datTraits_bin)
# Check samples are clustering together
plotDendroAndColors(sampleTree2,
traitColors,
groupLabels = colnames(datTraits_bin),
main = "Sample dendrogram and trait heatmap normalised")
# Pick soft power threshold -----------------------------------------------
# Choose a set of soft-thresholding powers
power <- c(c(1:10), seq(from = 12, to = 50, by = 2))
# Call the network topology analysis function
sft <- pickSoftThreshold(datExpr,
powerVector = power,
networkType = "signed",
verbose = 5)
sft.data <- sft$fitIndices
# visualization to pick power
a1 <-
ggplot(sft.data, aes(Power, SFT.R.sq, label = Power)) +
geom_point() +
geom_text(nudge_y = 0.1) +
geom_hline(yintercept = 0.8, color = 'red') +
labs(x = 'Power', y = 'Scale free topology model fit, signed R^2') +
theme_classic()
a2 <-
ggplot(sft.data, aes(Power, mean.k., label = Power)) +
geom_point() +
geom_text(nudge_y = 0.1) +
labs(x = 'Power', y = 'Mean Connectivity') +
theme_classic()
grid.arrange(a1, a2, nrow = 2) # Selected 18 as its above threshold and low mean connectivity
by subsetting the top 25% genes with the greatest variance you are likely selecting subset of DEG whose expression profile is affected by a strong driver of variation. If you run a PCA on the vst transformed expression data of the 5000 genes you will likely see 2 cluster of samples
No looked the same as previous as the DESeq2::plotpca function takes the rows with the greatest variance anyway.
I just dont really understand why I have this odd bounce from 1-3, 3-10 and back up again without meeting the 0.8 threshold.
Can you post the PCA plot?