Normalization Method In Comparing Differential Gene Expression
2
0
Entering edit mode
12.0 years ago
camelbbs ▴ 710

HI,

I see cuffdiff compare the RPKM from two samples but DESeq and edgeR compare the raw read counts. Now what is the normalization method in DESeq, as I know it use median of scaled counts, what is that? How to get it from the raw read counts.

Thanks, Che

rna-seq • 23k views
ADD COMMENT
0
Entering edit mode

Hello! Thanks for the comments. Must you eliminate those probes that are always zero in all your samples before doing the normalization or it is not necessary?

ADD REPLY
0
Entering edit mode

Please do not add an answer unless you're answering the top level question. I'm moving this to a comment, but please be more careful in the future.

ADD REPLY
10
Entering edit mode
12.0 years ago
bdemarest ▴ 460

The R code that computes size factors in DESeq can be viewed by:

library(DESeq)
estimateSizeFactorsForMatrix

function (counts, locfunc = median) 
{
    loggeomeans <- rowMeans(log(counts))
    apply(counts, 2, function(cnts) exp(locfunc((log(cnts) - 
        loggeomeans)[is.finite(loggeomeans)])))
}
<environment: namespace:DESeq>

You can see how it works in the following example, using crudely simulated RNA-Seq data:

library(DESeq)
set.seed(31415926)

# Create some simulated count data.
# All samples have the same underlying count, but with varying numbers
# total reads per sample. Poisson distribution used for simplicity,
# but note that real RNA-Seq data has higher variance than poisson.

true_size_factors = c(1, 1.2, 1.5, 2.0)
true_mean_counts = c(10, 20, 100, 500, 1000, 5000)

e = expand.grid(true_size_factors, true_mean_counts)
lambdas = e$Var1 * e$Var2

raw_data = matrix(rpois(n=nrow(e), lambda=lambdas), ncol=4, byrow=TRUE)
rownames(raw_data) = paste("gene", 1:nrow(raw_data), sep="_")
colnames(raw_data) = paste("sample", 1:ncol(raw_data), sep="_")

raw_data
#        sample_1 sample_2 sample_3 sample_4
# gene_1       15        9       11       18
# gene_2       19       21       21       40
# gene_3      106      114      153      207
# gene_4      569      565      756      992
# gene_5     1029     1260     1559     1968
# gene_6     5049     5897     7537    10029

sizes = estimateSizeFactorsForMatrix(raw_data)
sizes
#  sample_1  sample_2  sample_3  sample_4 
# 0.7605013 0.8659266 1.0771231 1.4414108 

raw_data / do.call(rbind, rep(list(sizes), 6))
#          sample_1   sample_2   sample_3   sample_4
# gene_1   19.38997   10.77383   10.12557   12.53715
# gene_2   24.56063   25.13894   19.33063   27.86033
# gene_3  137.02245  136.46853  140.83745  144.17719
# gene_4  735.52615  676.35717  695.90269  690.93608
# gene_5 1330.15186 1508.33635 1435.06918 1370.72804
# gene_6 6526.66350 7059.25354 6937.85528 6985.28022

Compare this to the more automated method described in the DESeq vignette:

cds = newCountDataSet(raw_data, conditions=c("trt", "trt", "cont", "cont"))
cds = estimateSizeFactors(cds)
cts = counts(cds, normalized=TRUE)

cts
#          sample_1   sample_2   sample_3   sample_4
# gene_1   19.38997   10.77383   10.12557   12.53715
# gene_2   24.56063   25.13894   19.33063   27.86033
# gene_3  137.02245  136.46853  140.83745  144.17719
# gene_4  735.52615  676.35717  695.90269  690.93608
# gene_5 1330.15186 1508.33635 1435.06918 1370.72804
# gene_6 6526.66350 7059.25354 6937.85528 6985.28022

I would also add that the variance stabilizing transformation provided by DESeq is very useful. The Authors recommend using it for clustering, heatmaps, or other visualization.

# Use the variance stabilized data for clustering, heatmaps, etc.
cds = estimateDispersions(cds, method="blind")
vsd = getVarianceStabilizedData(cds)
vsd
#         sample_1  sample_2  sample_3  sample_4
# gene_1  7.414359  7.435870  7.422947  7.372632
# gene_2  7.544171  7.469999  7.607719  7.489916
# gene_3  8.379632  8.334064  8.315754  8.338468
# gene_4  9.814620  9.784286  9.816016  9.850681
# gene_5 10.494539 10.634479 10.620615 10.677170
# gene_6 12.715437 12.842110 12.778146 12.808379
ADD COMMENT
0
Entering edit mode

Thanks, very nicely explained

ADD REPLY
0
Entering edit mode

Hi, The code and explanation is definitely helpful! However, I found that I get different numbers for sizes when I paste the code in R. The values that I get match with what I get in Excel. Was wondering what could be the reason for the different sizes you get. P.S. I also want to note that the final numbers match with DESEQ normalization, so it is more of a curiosity and not a major concern. Thanks.

> set.seed(31415926)
>
> # Create some simulated count data.
> # All samples have the same underlying count, but with varying numbers
> # total reads per sample. Poisson distribution used for simplicity,
> # but note that real RNA-Seq data has higher variance than poisson.
>
> true_size_factors = c(1, 1.2, 1.5, 2.0)
> true_mean_counts = c(10, 20, 100, 500, 1000, 5000)
>
> e = expand.grid(true_size_factors, true_mean_counts)
> lambdas = e$Var1 * e$Var2
>
> raw_data = matrix(rpois(n=nrow(e), lambda=lambdas), ncol=4, byrow=TRUE)
> rownames(raw_data) = paste("gene", 1:nrow(raw_data), sep="_")
> colnames(raw_data) = paste("sample", 1:ncol(raw_data), sep="_")
>
> raw_data
       sample_1 sample_2 sample_3 sample_4
gene_1       15        9       11       18
gene_2       19       21       21       40
gene_3      106      114      153      207
gene_4      569      565      756      992
gene_5     1029     1260     1559     1968
gene_6     5049     5897     7537    10029
> estimateSizeFactorsForMatrix = function (counts, locfunc = median)
+ {
+     loggeomeans <- rowMeans(log(counts))
+     apply(counts, 2, function(cnts) exp(locfunc((log(cnts) -
+         loggeomeans)[is.finite(loggeomeans)])))
+ }
> sizes = estimateSizeFactorsForMatrix(raw_data)
> sizes
 sample_1  sample_2  sample_3  sample_4
0.7735959 0.8353574 1.0863588 1.4357334

> raw_data / do.call(rbind, rep(list(sizes), 6))
         sample_1   sample_2   sample_3   sample_4
gene_1   19.38997   10.77383   10.12557   12.53715
gene_2   24.56063   25.13894   19.33063   27.86033
gene_3  137.02245  136.46853  140.83745  144.17719
gene_4  735.52615  676.35717  695.90269  690.93608
gene_5 1330.15186 1508.33635 1435.06918 1370.72804
gene_6 6526.66350 7059.25354 6937.85528 6985.28022
ADD REPLY
5
Entering edit mode
12.0 years ago

read the deseq paper it's well explained in it: http://genomebiology.com/2010/11/10/R106

and check also the R package with a very easy-reading vignette : http://bioconductor.org/packages/release/bioc/html/DESeq.html

ADD COMMENT

Login before adding your answer.

Traffic: 808 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6