Entering edit mode
2.3 years ago
JACKY
▴
160
I have a counts dataframe of RNA-seq dataset, and got the gene lengths using this code:
exons = exonsBy(EnsDb.Hsapiens.v86, by="gene")
exons = reduce(exons)
len = sum(width(exons))
INDEX = intersect(rownames(counts),names(len))
geneLengths = len[INDEX ]
counts = counts[INDEX ,]
And used the lengths to normalize the data from counts to TPM, using this code:
rpkm <- apply(X = subset(counts),
MARGIN = 2,
FUN = function(x) {
10^9 * x / geneLengths / sum(as.numeric(x))
})
TPM <- apply(rpkm, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()
I don't get errors, I do get the normalized data.. but I also get a list of warnings like this:
Warning messages:
1: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
2: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
3: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
4: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
5: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
6: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
7: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
8: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
9: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
And it goes on for about 50 warnings...
Should I pay attention to these warnings? do they affect the normalization process?
I think you should be very concerned. It is hard to see from the code fragments what goes wrong, but I would double check that all objects involved have the desired length or dimensions which is likely not the case. Simple example:
So, x and genelength are not of the same size. Further, why do you use the subset function without further arguments? See ?subset
Also, I suggest using a proven Bioconductor package to do the job.
A proven Bioconductor package you mean using a package for normalization?
If so, is there any package you recommend ?
for example edgeR see https://support.bioconductor.org/p/64585/
Refer to the problem with rpkm (and tpm) , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7373998/ and other sources for a discussion of the validity of rpkm/tpm for cross sample comparisons.
Thanks!
Do you also think my genelength calculation method is slumpy?
What do you think about this approach?
Reposted (with additions) to Bioconductor https://support.bioconductor.org/p/9146002
Hi Gordon, do you really mean to link to this post? Looks very different
Sorry, wrong link, now corrected.