I have counts data. I need to run software in R that accepts only normalized data. I normalized to TPM with this code:
rpkm <- apply(X = subset(raw1),
MARGIN = 2,
FUN = function(x) {
10^9 * x / geneLengths / sum(as.numeric(x))
})
TPM1 <- apply(rpkm, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()
However, check out the results. This is the expression of CD274 in both counts data (raw1
) and the normalized TPM data:
> raw1['CD274',]
Pt1 Pt10 Pt101 Pt103 Pt106 Pt11 Pt17 Pt18 Pt2 Pt24 Pt26 Pt27 Pt28 Pt29 Pt3 Pt30 Pt31 Pt34 Pt36 Pt37 Pt38 Pt39 Pt4 Pt44 Pt46 Pt47
CD274 1484 290 1421 251 203 888 608 1203 1340 1021 182 170 291 401 140 117 582 1177 191 152 111 24 187 705 1122 694
Pt48 Pt49 Pt5 Pt52 Pt59 Pt62 Pt65 Pt66 Pt67 Pt72 Pt77 Pt78 Pt79 Pt8 Pt82 Pt84 Pt85 Pt89 Pt9 Pt90 Pt92 Pt94 Pt98
CD274 224 1122 501 268 1277 270 705 276 88 157 2564 25 251 255 484 96 37 180 169 949 1477 128 321
> TPM1['CD274',]
Pt1 Pt10 Pt101 Pt103 Pt106 Pt11 Pt17 Pt18 Pt2 Pt24 Pt26 Pt27 Pt28 Pt29 Pt3
CD274 35.0266 4.280535 28.67831 3.449004 4.33621 19.67596 13.56328 25.0671 34.08708 17.27277 4.702501 4.485883 7.244041 8.91973 3.374477
Pt30 Pt31 Pt34 Pt36 Pt37 Pt38 Pt39 Pt4 Pt44 Pt46 Pt47 Pt48 Pt49 Pt5
CD274 3.103927 13.60881 20.59666 4.317299 3.056179 2.427168 0.5931633 3.93912 15.99866 15.20747 17.17709 4.543694 21.67822 13.52313
Pt52 Pt59 Pt62 Pt65 Pt66 Pt67 Pt72 Pt77 Pt78 Pt79 Pt8 Pt82 Pt84 Pt85
CD274 6.388511 27.02948 6.314665 13.77411 7.229454 1.796893 4.571327 56.26996 0.5742323 6.085954 5.998064 12.66232 2.63342 0.785415
Pt89 Pt9 Pt90 Pt92 Pt94 Pt98
CD274 3.325707 4.490217 14.52295 40.60652 3.159773 7.993074
something doesn't make sense. Look at Pt103
and Pt106
in both of them. Pt103
has higher expression in the raw1
data, but in TPM Pt106
has higher expression. How could this be? is my normalization wrong or could it happen due to gene length?
LChart I see, now I get it. I thought the whole process was faulty becasue of this and had to make sure.
Thank you!
the way to verify this is to look at the lengths of the two transcripts
the length is the only normalization factor when comparing transcripts within the same sample. The
raw_count/length
ratios ought to show the same behavior.