TMM normalized TPM
1
0
Entering edit mode
3.9 years ago
firestar ★ 1.6k

I have raw counts and gene lengths. Any tool that can compute TMM normalised gene length corrected data? I want to compare expression of one gene to another gene. Standard TPM doesn't seem to take compositional bias into account?

Would something like this in DESeq2 work?

x <- raw_counts/gene_length

library(DESeq2)
d <- DESeqDataSetFromMatrix(countData=x,colData=m,design=~Sample)
d <- DESeq2::estimateSizeFactors(d,type="ratio")
y <- counts(d,normalized=TRUE)

And y would be GeTMM.

RNA-Seq • 3.0k views
ADD COMMENT
0
Entering edit mode

I think so, although I would do the opposite since DEseq2 expects raw counts as input (first do the TMM normalization, then scale to gene length). Not sure whether it will change the results much, but at least it should prevent DESeq2 from throwing warnings.

ADD REPLY
0
Entering edit mode

I tried to do a reproducible example to go from raw counts to TMM normalized gene length corrected TPM counts.

Download sample data file GSE60450_Lactation-GenewiseCounts.txt from https://figshare.com/s/1d788fd384d33e913a2a.

raw_data <- read.delim("GSE60450_Lactation-GenewiseCounts.txt",header=T)
colnames(raw_data) <- c("gene","length",LETTERS[1:12])
head(raw_data)

##        gene length   A   B   C   D   E   F   G   H   I   J
## 1    497097   3634 438 300  65 237 354 287   0   0   0   0
## 2 100503874   3259   1   0   1   1   0   4   0   0   0   0
## 3 100038431   1634   0   0   0   0   0   0   0   0   0   0
## 4     19888   9747   1   1   0   0   0   0  10   3  10   2
## 5     20671   3130 106 182  82 105  43  82  16  25  18   8
## 6     27395   4203 309 234 337 300 290 270 560 464 489 328
##     K   L
## 1   0   0
## 2   0   0
## 3   0   0
## 4   0   0
## 5   3  10
## 6 307 342

Take count data to a new object. Add genes as rownames.

cr <- raw_data[,3:14]
rownames(cr) <- raw_data$gene
cr[1:5,1:5]

##             A   B  C   D   E
## 497097    438 300 65 237 354
## 100503874   1   0  1   1   0
## 100038431   0   0  0   0   0
## 19888       1   1  0   0   0
## 20671     106 182 82 105  43

Create DESeq2 object.

d <- DESeqDataSetFromMatrix(countData=cr,colData=data.frame(Sample=factor(colnames(cr))),design=~Sample)
d <- DESeq2::estimateSizeFactors(d,type="ratio")

Then the gene length file is prepared. I am really not sure about this part. DESeq2 needs an assay of gene lengths I think. So I prepared a dataframe equal in dimensions to the input count dataframe.

# prepare gene length data
gl <- matrix(rep(raw_data$length,length(cr)),byrow=F,nrow=nrow(cr))
colnames(gl) <- colnames(cr)
rownames(gl) <- rownames(cr)
head(gl)

             A    B    C    D    E    F    G    H    I    J    K    L
497097    3634 3634 3634 3634 3634 3634 3634 3634 3634 3634 3634 3634
100503874 3259 3259 3259 3259 3259 3259 3259 3259 3259 3259 3259 3259
100038431 1634 1634 1634 1634 1634 1634 1634 1634 1634 1634 1634 1634
19888     9747 9747 9747 9747 9747 9747 9747 9747 9747 9747 9747 9747
20671     3130 3130 3130 3130 3130 3130 3130 3130 3130 3130 3130 3130
27395     4203 4203 4203 4203 4203 4203 4203 4203 4203 4203 4203 4203

Add gene lengths to DESeq2 object.

assays(d)$avgTxLength <- gl

Now, FPM and FPKM can be computed.

fpm(d,robust=TRUE)[1:5,1:5]

##                     A           B          C           D
## 497097    18.85684388 13.77543859 2.69700983 10.45648006
## 100503874  0.04305215  0.00000000 0.04149246  0.04412017
## 100038431  0.00000000  0.00000000 0.00000000  0.00000000
## 19888      0.04305215  0.04591813 0.00000000  0.00000000
## 20671      4.56352843  8.35709941 3.40238163  4.63261775
##                   E
## 497097    16.442685
## 100503874  0.000000
## 100038431  0.000000
## 19888      0.000000
## 20671      1.997275

fpkm(d,robust=TRUE)[1:5,1:5]

##                     A           B          C          D
## 497097    4.165175871 3.092204540 0.66005325 2.70713930
## 100503874 0.010603758 0.000000000 0.01132312 0.01273687
## 100038431 0.000000000 0.000000000 0.00000000 0.00000000
## 19888     0.003545465 0.003842916 0.00000000 0.00000000
## 20671     1.170322849 2.178005299 0.96676308 1.39249019
##                   E
## 497097    4.4166109
## 100503874 0.0000000
## 100038431 0.0000000
## 19888     0.0000000
## 20671     0.6228664

For TPM, calculate FPKM, then divide by sum of column and multiply by 10⁶.

counts_fpkm <- fpkm(d,robust=TRUE)
counts_tpm <- (counts_fpkm/colSums(counts_fpkm))*10^6
counts_tpm[1:5,1:5]

##                      A           B          C           D
## 497097    11.753134666 2.084051870 0.42981212 4.059786074
## 100503874  0.029089363 0.000000000 0.00763144 0.008293971
## 100038431  0.000000000 0.000000000 0.00000000 0.000000000
## 19888      0.008168239 0.009407778 0.00000000 0.000000000
## 20671      2.718478619 5.017809428 2.36671617 3.820028160
##                  E
## 497097    7.522179
## 100503874 0.000000
## 100038431 0.000000
## 19888     0.000000
## 20671     1.757581

And now we compute sum of columns for TPM.

colSums(counts_tpm)

##         A         B         C         D         E         F 
##  740336.2  754730.7  851205.2  901837.7  913654.5  831364.9 
##         G         H         I         J         K         L 
##  754482.1  765813.5 1275904.8 1401345.7 2865426.6 2598466.1

Doesn't add up to 1M.

ADD REPLY
1
Entering edit mode

use t to transpose the matrix before dividing by the sum of columns and transpose it back:

counts_tpm <- t(t(counts_fpkm)/colSums(counts_fpkm))*10^6
colSums(counts_tpm)
##     A     B     C     D     E     F     G     H     I     J     K     L 
## 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
ADD REPLY
0
Entering edit mode

Yes! That works! Thanks!

ADD REPLY
0
Entering edit mode
3.9 years ago
ATpoint 86k

Simply use the DESeq2::fpkm function. It does fpkm but robustified with its sizefactor-based normalization approach. That should be all you need I guess. Alternatively use edgeR, so make a DGEList, then run calcNormFactors and then edgeR::rpkm, that is from the concept the same as DESeq2::fpkm.

ADD COMMENT
0
Entering edit mode

Do all columns sum up to 1 million in this case?

ADD REPLY
1
Entering edit mode

No. But you can convert FPKMs to TPMs with a simple formula. See: fpkm to tpm conversion

ADD REPLY
0
Entering edit mode

That's brilliant! Thanks!

ADD REPLY
0
Entering edit mode

I do not know, probably not.

ADD REPLY
0
Entering edit mode

Experts, correct me if I am wrong. I read that, one of the advantages of TPM is that columns sum to 1M so that the proportion of a gene is comparable sample to sample. And FPKM doesn't do this. So is it advisable to use FPKM?

ADD REPLY
1
Entering edit mode

If they are "robust" TPMs, they will not add up to 1M, but will be 1M on average.

ADD REPLY
1
Entering edit mode

Yup, and the FPKM to TPM conversion won't get you robust TPMs -- to get robust TPMs, it's advisable to first calculate your TPMs (which sum up to 1E6), then apply the size factor normalization on the TPMs. This is what's done in a few differential gene expression packages (e.g. sleuth).

ADD REPLY

Login before adding your answer.

Traffic: 3784 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6