Hello,
I'm quite new to RNA-seq data analysis and I was wondering if someone could help me out with some normalization. I have 4 biological replicates per sample and would like to normalize the data using housekeeping genes, among other reasons because it is known that the tissues where the RNA comes from contain different amounts of cells hence I would like to correct for this. The question with housekeeping genes always is of course if they are actually stable, but in our case there is data publicly available that can help us select stably expressed genes.
I'm using edgeR for my differential gene expression analysis. I know it has a function that is designed for normalization using HKGs but I'm not sure if this also works in the case where one does have replicates. As far as I understand it, in this way edgeR take 1 experimental group, then calculates the normalization factor it needs to use based on the housekeeping gene expression, and finally applies this to all groups. Is this correct? And should this method be used if one does have biological replicates?
Besides the above, I haven't been able to get it to work quite yet. If I follow the edgeR manual, I end up with all my p-values and FDRs being 0, which is a bit too good to be true. Also I can't seem to use multiple HKGs, even though it is of course sensible to use many of them. Below is the error message I get when I try to use two or more HKGs in the command, although using either one seperately does let me continue (although the resulting p-values/FDRs are not realistic).
> y0 <- estimateCommonDisp(y1["Hkg1","Hkg2"])
Error in subsetListOfArrays(object, i, j, IJ = IJ, IX = IX, I = I, JX = JX) :
Subscript not found in colnames
Can anyone tell me what I'm doing wrong? And, perhaps more importantly, is this the right method to go about the normalization, or are there better alternatives that I could try? Many thanks in advance for your input.
===edit=== The code after assigning samples to groups: After assigning samples to groups:
> keep <- rowSums(cpm(y)>1) >= 4
> y <- y[keep, , keep.lib.sizes=FALSE]
> y <- calcNormFactors(y)
> y1 <- y
> y1$samples$group <- 1
> y0 <- estimateCommonDisp(y1["Hkg1",])
> y$common.dispersion <- y0$common.dispersion
> de <- exactTest(y, pair = c("Group1", "Group2"))
Actually I found that if I use other HKGs there actually are more believable p and FDR-values, so it seems to depend on which gene I use for normalization.
What's your experiment design ? Also please post entire code from normalisation to DE calling. You must have something wrong in your code to have all genes FDR = 0.
We're looking at different brain regions that either did or did not receive a certain treatment, and we're comparing between them. The different brain regions all come from the same animals (so region A and B can be collected from the same brain). I've edited the original post to contain the code you requested.