Question

Differential expressed gene analysis ---- to normalize the count data and fit into a model

0

Entering edit mode

10.2 years ago

juncheng ▴ 220

Hi,

I have two corresponded count table, they are RNA-seq count data.

Example, actual matrix much larger:

Matrix A:

    3    3    6    2     0     0
   15   10    0    8     0     0
    0    0    0    0     0     0
    0    0    0    0     0     0
    3   14    2    2     2     7

Matrix B:

   29.0        31.5          27        29.5        0.0           0
   33.0        37.5          0         34.0        0.0           0
   0.0         0.0           0         0.0         0.0           0
   0.0         0.0           0         0.0         0.0           0
   24.5        22.5          26        23.5        23.5          24

The second matrix is a average of two integer count table, so it has *.5 numbers.

The two matrix are corresponding with each other, that means what in the first matrix is 0, it is 0 at the second.

The problem is, I want normalize the first matrix with the second one, i.e, by element-wise division of first matrix with the second.

the data are extremely skewed with a lot of '0's.
the number in second matrix is general much larger than the first. After division, I end up with a matrix containing very tinny numbers and '0's.

I want to find differentially expression genes across conditions. (each 3 columns are replicates). I'm now stacked with how to properly normalize and what kind of model to fit into a statistic test.

Any help would be appreciated.

Best

R RNA-Seq • 3.0k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by juncheng ▴ 220

1

Entering edit mode

You can perform gene-wise normalization in limma, edgeR and DESeq2 (this is how conditional quantile normalization works), but the bigger questions are:

What does matrix B actually represent? That is, where did the original count tables come from?
What's the actual goal of the gene-wise normalization?

Answering #2 will also have answer the appropriate type of model to use.

ADD REPLY • link 10.2 years ago by Devon Ryan 104k

0

Entering edit mode

It actually hard to explain, my goal is quite specific. Short answer is the first count data is dependent on the second, as I'm interested in the part of variance from matrix A, which are independent of the matrix B, I would like to normalize the matrix A against B.

ADD REPLY • link 10.2 years ago by juncheng ▴ 220

0

Entering edit mode

OK, well then doing something like the following in DESeq2 may be the easiest:

normalizationFactors(dds) <- matrix_B/mean(matrix_B)

Note that dds is a DESeqDataSet object. You can do something similar in edgeR and limma.

ADD REPLY • link 10.2 years ago by Devon Ryan 104k

0

Entering edit mode

I should mention that using matrix B directly as I showed may be the opposite of what you want. This way will essentially bias estimates toward values with high matrix B values. If you instead want the exact opposite of this then you'll need to modify matrix B accordingly before using it (still divide it by its mean).

ADD REPLY • link 10.2 years ago by Devon Ryan 104k

Ram · Answer 1 · 2014-09-18

Your matrix A and B - are they environmental replicates? Integrate these two matrices separately into your analysis if so - try filtering instead. Below is a formula I use in edgeR to filter low-expressed genes across RNA-Seq count tables - then use the exact test to compare these groups. First, try using some code I from a very good Nature Protocols paper:

Anders, S. et al. (2013) Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat. Protoc. 8, 1765–86

library(edgeR)

group <- factor(c(rep("your_group_1", 3) , rep("your_group_2", 3)))

d <- DGEList(counts = counts, group = group, genes = genes)
k <- rowSums(cpm(d)>1)>=3
d <- d[k,]

d<-estimateCommonDisp(d,verbose=TRUE)
d<-calcNormFactors(d,method="TMM")
d<-estimateTagwiseDisp(d,verbose=TRUE)
d.est<-estimateSmoothing(d)

et=exactTest(d,pair=c(1:2))

tt=topTags(et,n=nrow(d))
rn=rownames(tt$table)
deg=rn[tt$table$FDR<0.05]

plotMeanVar(d,show.tagwise.vars=TRUE,NBline=TRUE)
plotBCV(d)
plotSmear(d,de.tags=deg)

Cheers!