Hi,
I have some RNA-seq samples that I want to normalize and then output RPKM expression, but I am unsure how to do this.
This is my pipeline so far:
Normalise raw read counts with TMM in edgeR
expr <- DGEList(counts=data, group=conditions)
expr <- calcNormFactors(expr)
output:
$samples
group lib.size norm.factors
Sample1 F 19770521 1.0462660
Sample2 F 17970679 0.8794805
Sample3 F 19184265 1.0573665
QUESTION: How do I get normalized raw read counts from this? Do I multiply the read counts by the norm.factors?
QUESTION: Ultimately, I want to end up with RPKM values for each gene in each sample. I know I can use the rpkm() function below in edgeR
expr_norm <- rpkm(expr, log=FALSE,gene.length=vector)
but is expr the output from calcNormFactors or something else?
Thanks for your help!
A
great thanks very much!!
so just to clarify if I do rpkm(expr) then the output will be rpkm files that are both TMM normalised and also length normalised? Is this because it is reading in the norm.factors and the unnormalised read counts from DGEList?
thanks!
Oops, I forgot the "gene.length" part. I've updated my answer accordingly. Regardless, yes, rpkm can get the normalization factors from the DGEList, so you needn't much with that yourself. If you input counts as a matrix then it'll just do library size normalization, which is less robust.
Thanks! And when you say as a matrix, you mean if I divided the counts by the normalisation factor manually and then gave this as an input to rpkm()?
Or just fed it in as is. Feeding in the normalized counts would probably lead to weird results, since the
M
inRPKM
is there for normalization (so you would have normalized a normalized result...which will likely muck it up slightly).Cool, thanks! So I will use the
DGEList
output fromcalcNormFactors
as the input for rpkm along with gene length information!I also have two other questions, that Id be very grateful if you could answer.
Q1. When creating the
expr <- DGEList(counts=data, group=conditions)
, what effect does specifying groups have one the TMM normalisation? How does TMM use this information and how would the results differ if you did specify groups versus not?Q2. The expression data I am using was obtained from mapping reads onto denovo contigs assembled with Trinity. I then chose the most highly expressed contig from each cluster as the "best isoform" and then summed expression across all the contigs in the cluster as the expression value for that cluster. Therefore I do not have one obvious gene length to use. Should I use the longest contig from the cluster?
Thanks!