TMM normalisation from RSEM raw counts
1
0
Entering edit mode
10.5 years ago
AW ▴ 350

Hi,

I want to normalise raw read counts obtained from RSEM, using TMM in edgeR.

expr <- DGEList(counts=data)
expr <- calcNormFactors(expr)

I will then obtain rpkm values from this expr object using

expr_norm <- rpkm(expr, log=TRUE, gene.length=gene_length$Length)

For the gene lengths I need to specify, should I use length or effective length from the RSEM output?

Thanks very much

Alison

Rsem RNA-seq rpkm TMM EdgeR • 8.7k views
ADD COMMENT
0
Entering edit mode

Yes I agree, but I just realised because expression is different between my samples, the effective lengths differ for a given gene! And I can only give expr_norm one length per gene.

I have also noticed that gene length actually differs across samples mapped to the same reference. This seems unexpected to me, as the length calculate is independent of expression?

A

ADD REPLY
0
Entering edit mode

I should have mentioned the per-sample effective length issue. The rpkm() function is pretty simple, so you can just implement your own that can use per-sample lengths. It's not surprising that you get different effective lengths between samples, since the fragment distribution can vary. I don't know off-hand how RSEM calculates length, but I suspect that it's the expected length given the transcript composition, which can also vary by sample.

ADD REPLY
0
Entering edit mode

Great, thanks, this is very helpful!

So you would recommend using the per-sample lengths instead of one length for all samples eg the length of the longest isoform? Hypothetically, when using rpkm() do you not need to control for the length over which reads can map ie length of longest isoform?

And finally, if I were to use rpkm() for each per-sample length what command would I use?

I currently use

expr_norm <- rpkm(expr, log=TRUE, gene.length=gene_length$Length)

How do I modify this to have different per-sample lengths?

ADD REPLY
1
Entering edit mode

The benefit of using a per-sample length is that if you have a difference in isoform distribution between samples (e.g., Group A expresses primarily isoform 1 and Group B isoform 2, where the isoforms have different lengths) then the resulting RPKMs better represent this. This is, in fact, how cufflinks works.

As I mentioned, there's no existing command in edgeR to produce rpkm with per-sample lengths. The rpkm() command itself, however, is very simple:

rpkm <- function(x, gene.length, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25)
{
        y <- cpm(x=x,normalized.lib.sizes=normalized.lib.sizes,log=log,prior.count=prior.count)
        gene.length.kb <- gene.length/1000
        if(log)
                y-log2(gene.length.kb)
        else
                y/gene.length.kb
}

So, just take the matrix of cpm values and divide that by the matrix of per-sample lengths (divided by 1000 if the lengths are in bp rather than kb).

ADD REPLY
0
Entering edit mode

Thanks very much, that is very clear.

Just as note, Im don't think you divide gene lengths by 1000. The manual rpkm says to use bases.

gene.length: vector of length nrow(x) giving gene length in bases, or the name of the column x$genes containing the gene lengths.

ADD REPLY
0
Entering edit mode

You do need to divide by 1000 if you do it yourself. The code I posted is the rpkm() function from edgeR, so it's expecting bases since it does the division by 1000 internally.

ADD REPLY
0
Entering edit mode

Great, thanks for being so clear!

I have one last question. I have used rpkm() with the commands below and then manually calculated rpkm, but I get different values, why?

rpkm()

expr_norm <- rpkm(expr, log=FALSE, gene.length=gene_length$Length)

manual rpkm

(((expected_counts_from_RSEM*norm.factor)*10^6)/(gene_length_kb*lib.size))

Thanks!

ADD REPLY
1
Entering edit mode

You should be dividing by norm.factor, not multiplying by it. You could make your life easier and just use the cpm() function, since presumably you already have the expected counts as a DGElist

ADD REPLY
0
Entering edit mode

Perfect, thanks very much!

However, when I sum the rpkm columns across each sample, they are very different. I was not expecting this, as I assume that the normalisation is meant to control for this. For example the minimum total rpkm for sample 1 is 790000 but for sample 4 is 1000000. Why are they so different?

ADD REPLY
0
Entering edit mode

Summing the columns will only give the same results if you don't use a normalized library size (a bad idea) and use a single length for each gene (kind of defeats the purpose).

ADD REPLY
0
Entering edit mode

Sorry, I dont think I was particularly clear. I have used TMM and rpkm() to normalise across 4 samples. I wanted to check that the normalisation has worked as previously I have had problems with this. In order to check, I summed the rpkm columns for each sample, with the assumption that the totals should be reasonably similar. But they are not (up to 25% different), and I was wondering what your thoughts on this were?

Thanks so much!!

ADD REPLY
0
Entering edit mode

And, even before rpkm() I'm not sure the TMM has worked as the lib.size*norm.factors are quite different.

lib.size    norm.factors    lib.size*norm.factors
37230243    1.0990646       40918442.13
37682151    1.0782278       40629942.77
42397446    0.974394        41311817
33079364    0.8660274       28647635.6
ADD REPLY
0
Entering edit mode

I wouldn't conclude from that that things went amiss. Rather, it looks like the last library has a skew to it, likely due to rRNA or something like that.

ADD REPLY
0
Entering edit mode

That makes sense. I did remove any rRNA that blasted to the nearest reference. Have you any other suggestions as how to deal with this, as I feel uncomfortable comparing across samples due to this difference in library size?

ADD REPLY
0
Entering edit mode

Any highly expressed gene can cause an effect like this, rRNAs are just the most common of those. Just do a scatter plot of the normalized counts between a couple of the samples and ensure that points cluster around the diagonal. If they do, then things went well. That'll be vastly more informative than looking at the numbers you're looking at.

ADD REPLY
0
Entering edit mode

Great thanks! I have one last question though that I am concerned about. I am interested in identifying differentially expressed genes, and if the library sizes differ then won't this impact my assessment of sex-biased genes? eg if the library size of one sample is bigger, then won't I find that genes are expressed higher in the sample than another with a lower library size?

ADD REPLY
0
Entering edit mode

If the normalized counts cluster around the diagonal when plotted then the TMM normalization worked and the difference in library sizes is exactly what you want to avoid the situation that you're thinking of.

ADD REPLY
0
Entering edit mode
10.5 years ago

Effective length would make more sense, since that's the observable length.

ADD COMMENT

Login before adding your answer.

Traffic: 2673 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6