Hello,
I'm trying to obtain a list of the most highly variable genes of a set of data. I have genes in my columns and the samples per row. The RNA-Seq count data have already been normalized with cpm and applied with a log2 algorithm.
- I compute the mean and variance for each gene from an RNA-Seq count matrix.
- I calculate the standardized count by calculating the coefficient of variance.
- Rank the genes based on this coefficient of variance when plotting.
I'm not sure though if these are adequate steps to take. Please see below my code snippet.
#Apply log2 transformation
logCounts <- log2((NormalizedCounts + 1))
#Compute the mean and the variance for each gene
geneVars <- apply(logCounts, 2, var)
geneMeans <- colMeans(logCounts)
#Calculate the coefficient of variance based on previous values
geneCV2 <- geneVars / geneMeans^2
#Plot Data
smoothScatter(log2(means),log2(cv2))
Alternatively, I thought I could just use directly the index and subset on my original dataset.
sortedVariance <- sort(geneVars, decreasing=TRUE, index.return=TRUE)$ix[1:2500]
HGVs <- logCounts[, sorted]
Any suggestions if I'm doing it alright, as, in the end, I want to take the top 2000 most variable genes. I am also trying to avoid the use of packages like limma and Deseq2.
I should correct myself, this is on the log2 scale. I have upated my post
Your code is doing column-wise operations but genes are in rows.
The genes are actually in columns :)