Question

Normalization Method In Comparing Differential Gene Expression

0

Entering edit mode

12.7 years ago

camelbbs ▴ 710

HI,

I see cuffdiff compare the RPKM from two samples but DESeq and edgeR compare the raw read counts. Now what is the normalization method in DESeq, as I know it use median of scaled counts, what is that? How to get it from the raw read counts.

Thanks, Che

rna-seq • 24k views

ADD COMMENT • link updated 7.0 years ago by ra2967 ▴ 30 • written 12.7 years ago by camelbbs ▴ 710

0

Entering edit mode

Hello! Thanks for the comments. Must you eliminate those probes that are always zero in all your samples before doing the normalization or it is not necessary?

ADD REPLY • link 7.0 years ago by ra2967 ▴ 30

0

Entering edit mode

Please do not add an answer unless you're answering the top level question. I'm moving this to a comment, but please be more careful in the future.

ADD REPLY • link 7.0 years ago by Ram 45k

score 10 · Answer 1 · 2012-11-14

The R code that computes size factors in DESeq can be viewed by:

library(DESeq)
estimateSizeFactorsForMatrix

function (counts, locfunc = median) 
{
    loggeomeans <- rowMeans(log(counts))
    apply(counts, 2, function(cnts) exp(locfunc((log(cnts) - 
        loggeomeans)[is.finite(loggeomeans)])))
}
<environment: namespace:DESeq>

You can see how it works in the following example, using crudely simulated RNA-Seq data:

library(DESeq)
set.seed(31415926)

# Create some simulated count data.
# All samples have the same underlying count, but with varying numbers
# total reads per sample. Poisson distribution used for simplicity,
# but note that real RNA-Seq data has higher variance than poisson.

true_size_factors = c(1, 1.2, 1.5, 2.0)
true_mean_counts = c(10, 20, 100, 500, 1000, 5000)

e = expand.grid(true_size_factors, true_mean_counts)
lambdas = e$Var1 * e$Var2

raw_data = matrix(rpois(n=nrow(e), lambda=lambdas), ncol=4, byrow=TRUE)
rownames(raw_data) = paste("gene", 1:nrow(raw_data), sep="_")
colnames(raw_data) = paste("sample", 1:ncol(raw_data), sep="_")

raw_data
#        sample_1 sample_2 sample_3 sample_4
# gene_1       15        9       11       18
# gene_2       19       21       21       40
# gene_3      106      114      153      207
# gene_4      569      565      756      992
# gene_5     1029     1260     1559     1968
# gene_6     5049     5897     7537    10029

sizes = estimateSizeFactorsForMatrix(raw_data)
sizes
#  sample_1  sample_2  sample_3  sample_4 
# 0.7605013 0.8659266 1.0771231 1.4414108 

raw_data / do.call(rbind, rep(list(sizes), 6))
#          sample_1   sample_2   sample_3   sample_4
# gene_1   19.38997   10.77383   10.12557   12.53715
# gene_2   24.56063   25.13894   19.33063   27.86033
# gene_3  137.02245  136.46853  140.83745  144.17719
# gene_4  735.52615  676.35717  695.90269  690.93608
# gene_5 1330.15186 1508.33635 1435.06918 1370.72804
# gene_6 6526.66350 7059.25354 6937.85528 6985.28022

Compare this to the more automated method described in the DESeq vignette:

cds = newCountDataSet(raw_data, conditions=c("trt", "trt", "cont", "cont"))
cds = estimateSizeFactors(cds)
cts = counts(cds, normalized=TRUE)

cts
#          sample_1   sample_2   sample_3   sample_4
# gene_1   19.38997   10.77383   10.12557   12.53715
# gene_2   24.56063   25.13894   19.33063   27.86033
# gene_3  137.02245  136.46853  140.83745  144.17719
# gene_4  735.52615  676.35717  695.90269  690.93608
# gene_5 1330.15186 1508.33635 1435.06918 1370.72804
# gene_6 6526.66350 7059.25354 6937.85528 6985.28022

I would also add that the variance stabilizing transformation provided by DESeq is very useful. The Authors recommend using it for clustering, heatmaps, or other visualization.

# Use the variance stabilized data for clustering, heatmaps, etc.
cds = estimateDispersions(cds, method="blind")
vsd = getVarianceStabilizedData(cds)
vsd
#         sample_1  sample_2  sample_3  sample_4
# gene_1  7.414359  7.435870  7.422947  7.372632
# gene_2  7.544171  7.469999  7.607719  7.489916
# gene_3  8.379632  8.334064  8.315754  8.338468
# gene_4  9.814620  9.784286  9.816016  9.850681
# gene_5 10.494539 10.634479 10.620615 10.677170
# gene_6 12.715437 12.842110 12.778146 12.808379

score 5 · Answer 2 · 2012-11-14

5

Entering edit mode

12.7 years ago

Nicolas Rosewick 11k

read the deseq paper it's well explained in it: http://genomebiology.com/2010/11/10/R106

and check also the R package with a very easy-reading vignette : http://bioconductor.org/packages/release/bioc/html/DESeq.html

ADD COMMENT • link 12.7 years ago by Nicolas Rosewick 11k