Question

What are SizeFactors

5

Entering edit mode

5.9 years ago

2405592M ▴ 150

Hi Guys, Im new to RNA-seq and R-programming so forgive my ignorance in advance! I'm currently using a programme/script to help me map tRNAs (the supplied notes don't explain in much detail), and after tRNA counts are generated in all my conditions, they use SizeFactors to normalize the dataset (in DESeq2). I've tried to read up on what exactly SizeFactors are and I don't understand it. Could anyone give me an easy to understand definition of what size factors are and why they're used to normalize the data.

RNA-Seq R normalization • 8.0k views

ADD COMMENT • link 5.8 years ago by 2405592M ▴ 150

score 23 · Accepted Answer · 2019-01-17

A size factor relates to how many reads there are in each library. One can imagine that if you had two sample where 10% of the reads in each sample were from gene A, but in one sample 1M reads have been sequenced and in the other sample 2M reads had been sequenced then there would be a two fold increase in the counts from gene A in sample 2 compared to sample 1, but the actaul expression levels were the same.

Early RNAseq analysis divided counts by the total number of reads in each library, but this is poor practice for two reasons.

Using division means that you lose the discrete nature of the gene counts, and thus negative bionomial statisitcs no longer apply. Thus normalising factors are used as offsets in the linear model, rather than divisors.
In most RNAseq samples the most higly expressed genes take up the majority of the reads. Thus in a 1M read sample, if 300k reads came from gene A (leaving 700k for all the others), and that gene doubled in expression to 600k reads (leaving 400k for all the others), the expression of the other genes would appear to half, even though they have stayed the same.

Thus sizeFactors are related to the library size (total number of reads in the library), but are calculated in such a way as to compensate for effect 2 above. One common method (and the one I believe that is used by DESeq2), is to find the 75th centile of the distribution of read counts for each sample, and then calculate a normalisation factor such that the 75th centile is the same across all samples.