Entering edit mode
10.5 years ago
AW
▴
350
Hi,
I want to normalise raw read counts obtained from RSEM, using TMM in edgeR.
expr <- DGEList(counts=data)
expr <- calcNormFactors(expr)
I will then obtain rpkm values from this expr object using
expr_norm <- rpkm(expr, log=TRUE, gene.length=gene_length$Length)
For the gene lengths I need to specify, should I use length or effective length from the RSEM output?
Thanks very much
Alison
Yes I agree, but I just realised because expression is different between my samples, the effective lengths differ for a given gene! And I can only give expr_norm one length per gene.
I have also noticed that gene length actually differs across samples mapped to the same reference. This seems unexpected to me, as the length calculate is independent of expression?
A
I should have mentioned the per-sample effective length issue. The rpkm() function is pretty simple, so you can just implement your own that can use per-sample lengths. It's not surprising that you get different effective lengths between samples, since the fragment distribution can vary. I don't know off-hand how RSEM calculates length, but I suspect that it's the expected length given the transcript composition, which can also vary by sample.
Great, thanks, this is very helpful!
So you would recommend using the per-sample lengths instead of one length for all samples eg the length of the longest isoform? Hypothetically, when using rpkm() do you not need to control for the length over which reads can map ie length of longest isoform?
And finally, if I were to use rpkm() for each per-sample length what command would I use?
I currently use
expr_norm <- rpkm(expr, log=TRUE, gene.length=gene_length$Length)
How do I modify this to have different per-sample lengths?
The benefit of using a per-sample length is that if you have a difference in isoform distribution between samples (e.g., Group A expresses primarily isoform 1 and Group B isoform 2, where the isoforms have different lengths) then the resulting RPKMs better represent this. This is, in fact, how cufflinks works.
As I mentioned, there's no existing command in edgeR to produce rpkm with per-sample lengths. The rpkm() command itself, however, is very simple:
So, just take the matrix of cpm values and divide that by the matrix of per-sample lengths (divided by 1000 if the lengths are in bp rather than kb).
Thanks very much, that is very clear.
Just as note, Im don't think you divide gene lengths by 1000. The manual rpkm says to use bases.
You do need to divide by 1000 if you do it yourself. The code I posted is the rpkm() function from edgeR, so it's expecting bases since it does the division by 1000 internally.
Great, thanks for being so clear!
I have one last question. I have used
rpkm()
with the commands below and then manually calculated rpkm, but I get different values, why?rpkm()
manual rpkm
Thanks!
You should be dividing by norm.factor, not multiplying by it. You could make your life easier and just use the
cpm()
function, since presumably you already have the expected counts as aDGElist
Perfect, thanks very much!
However, when I sum the rpkm columns across each sample, they are very different. I was not expecting this, as I assume that the normalisation is meant to control for this. For example the minimum total rpkm for sample 1 is 790000 but for sample 4 is 1000000. Why are they so different?
Summing the columns will only give the same results if you don't use a normalized library size (a bad idea) and use a single length for each gene (kind of defeats the purpose).
Sorry, I dont think I was particularly clear. I have used TMM and rpkm() to normalise across 4 samples. I wanted to check that the normalisation has worked as previously I have had problems with this. In order to check, I summed the rpkm columns for each sample, with the assumption that the totals should be reasonably similar. But they are not (up to 25% different), and I was wondering what your thoughts on this were?
Thanks so much!!
And, even before
rpkm()
I'm not sure the TMM has worked as thelib.size*norm.factors
are quite different.I wouldn't conclude from that that things went amiss. Rather, it looks like the last library has a skew to it, likely due to rRNA or something like that.
That makes sense. I did remove any rRNA that blasted to the nearest reference. Have you any other suggestions as how to deal with this, as I feel uncomfortable comparing across samples due to this difference in library size?
Any highly expressed gene can cause an effect like this, rRNAs are just the most common of those. Just do a scatter plot of the normalized counts between a couple of the samples and ensure that points cluster around the diagonal. If they do, then things went well. That'll be vastly more informative than looking at the numbers you're looking at.
Great thanks! I have one last question though that I am concerned about. I am interested in identifying differentially expressed genes, and if the library sizes differ then won't this impact my assessment of sex-biased genes? eg if the library size of one sample is bigger, then won't I find that genes are expressed higher in the sample than another with a lower library size?
If the normalized counts cluster around the diagonal when plotted then the TMM normalization worked and the difference in library sizes is exactly what you want to avoid the situation that you're thinking of.