RNA-seq, why normalize for library size?
3
6
Entering edit mode
6.1 years ago
joselu ▴ 110

Hello. In an RNAseq experiment I do not understand why the number of readings for each sample should be normalized. The differences between the number of readings of the samples is not due to the differential expression of the genes? Why should we normalize this data? Thank you

RNA-Seq • 20k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Please do not write in ALL CAPS, there is no need to yell.

ADD REPLY
22
Entering edit mode
6.1 years ago

There are multiple bias engaged in RNAseq experiment : library size, genes length, RNA population composition for each condition and genes GC composition

Two bias can be discard if you compare genes amongst conditions only, because these two are inherent to the gene : genes length and genes GC composition

  • genes length : The raw count of two genes cannot be face off if gene A is twice longer than gene B. Due to its length, the longest gene will have much chance to be sequenced than the short one. And in the end, for the same expression level, the longest gene will get more read than the shortest one (pub1)
  • genes GC composition : I did not get the full explanation of this bias. For two genes with different GC content, the one with the closest GC content to 40% will be more sequenced than the other one. (pub2)

The others bias are "technical bias", due to your sample and sequencing method.

  • library size : the most well know bias. You create two libraries for two conditions with the same RNA composition. The second library works way better than the first one, you got 12 000 000 reads for condition A and 36 000 000 reads for condition B. You will have three times (36 000 000/12 000 000 = 3) more of each RNA in your condition B than your condition A. (pub3)

condition A consition B

  • RNA population composition for each condition : This one is more tricky. Let's say you have again two conditions A and B. For each condition, you want to study 4 genes and you want 90 reads (by condition)

RNA pop

Biologicaly, in your condition A, you got 3 genes expressed the same way (Gene1, Gene2 and Gene 3), arbitrary unit of 2, and you also got a gene (Gene 4) at 24 which is 12 times more expressed than the three others. In condition B, you also got these 3 genes expressed the same way at 2 but Gene 4 is not expressed at all.

In your desing, you want 90 reads for each conditions (A and B). Reads will be spread out according to the expression level. So, in condition A you have 12 times more reads on Gene 4 than on the 3 others (72/6 = 12). The funny thing is that in condition B, you also have 90 reads to spread, but this time, Gene 4 is not expressed. The reads will be spread out over the three genes left (Gene1, Gene2 and Gene 3).

You knew that the expression level were similar for Gene 1 for condition A and condition B. Expression level for gene 1 in condition A is 5 times smaller than expression level for gene 1 in condition B, biased by the miss of Gene 4.


To reduce these bias, there are a lot of method to normalize RNAseq data.

Those which I call naive ones :

  • Total count
  • Upper Quartile
  • RPKM (Reads Per Kilobase per Million, which is not solid enought for cross condition experiment, pub4 & pub5)

Those with a statistical power :

For the batch effect

  • RLE method (Relative log Expression) like DESeq2
  • TMM method (Trimmed Mean of M values) like edgeR

Plus, the most used rule to normalize gene count :

  • negative binomial distribution (edgeR, DESeq2)

Add to that a multiple testing correction, to output strong express genes (DESeq2)

I would say that, for the same amount of money, you better create replicates over a better gene covery.

This is my naive understanding of the subject, be free to correct what I said here.

Useful links (one is in french sorry) : pub6, pub7

Other links : pub8, pub9, pub10, pub11, pub12

ADD COMMENT
0
Entering edit mode

Hi Bastien, I am new to RNAseq. I was wondering do you have to specify the conditions (e.g. library size) when using RLE or TMM to normalize the data set? Do RLE and TMM take all the factors (e.g. library size, gene length, RNA population composition) into consideration automatically? Thanks!

ADD REPLY
0
Entering edit mode

Hello,

First if you're new to RNAseq, I strongly suggest you to take a look (read the whole doc) at DESeq2 vignette. Then, ask yourself the question, do I have this biais in my data

  • library size : You have equivalent number of mapped reads across samples -> not needed
  • gene length : you do not compare geneA expression against geneB expression -> not needed
  • RNA population composition : If you know that you don't have very expressed genes in A and very low expressed genes in B -> not needed

RNA population composition is hard to catch, I let you some info here and there

Library size is the major biais and could be handle in DESeq2 using the sizeFactor

ADD REPLY
0
Entering edit mode

Hi Bastien, Thanks a lot!! I'll definitely take a look at the DEseq2 vignette document you recommended. It looks like gene length and RNA population composition won't be a problem for me.

ADD REPLY
2
Entering edit mode
6.1 years ago
h.mon 35k

The differences between the number of readings of the samples is not due to the differential expression of the genes?

No, the differences between the number of readings is due to accidental variations in how much each different library is loaded into the flowcell and sequenced.

When loading a multiplexed RNAseq experiment into the flowcell, one quantifies the DNA (the initial RNA is reverse transcribed to DNA) amount for each library, "normalizes" the libraries (i.e., dilutes all libraries to the same DNA concentration), and loads the same amount of each library into the flowcell. In an ideal world, all libraries would have the same number of reads, and then no library size normalization would be necessary for the analysis - this is rarely the case, though, and there is substantial reads number variation between libraries.

ADD COMMENT
0
Entering edit mode

OK. But there may also be differences in expression. But how do we know if the differences are due to accidental variations or to a true differential expression? Thank you.

ADD REPLY
0
Entering edit mode

Could you edit your title please?

ADD REPLY
0
Entering edit mode

Read the edgeR User Guide, it provides both a good introduction to differential expression analysis; and several references if you want to study the methods in detail.

ADD REPLY
2
Entering edit mode
6.1 years ago

Apart from the differences in library depth explained by H.mon an additional problem is that RNASeq frequently have different amounts of different RNA types in them.

A simple example could be that you have more rRNA in one sample than in another (lets say 1% vs 20%) if you do not take this into account it would look like the majority of protein coding genes were downregulated simply because they would get a smaller fraction of reads.

Such effects is handled by doing a inter-library normalization an analysis build into all the major DE tool workflows. You can read more about this problem here.

ADD COMMENT

Login before adding your answer.

Traffic: 1975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6