I have expression values in terms of RPKM and RPM. I looked at the values of features which are in RPKM scale, are between 0-10000. so, before training model, I think it's necessary to normalize the data to bring them in smaller range of scale like 0-1 to avoid masking effect of the features with higher feature values.
I think people use log(intensities) in microarray. what is the reasonable way in RNA-seq data(lot's of zero values in RNA-seq)?
I think there is no accepted consensus how to do this best. Please share your experiences with us. The only way to find out definitively is to evaluate and compare different methods by e.g. n-fold cross-validation. This will allow us to give better advice in the future.
Possibly you should not transform at all in this case. This paper contains some interesting arguments, even though it is related to counts in ecology, it should be applicable also to RNA-seq counts. So, the recommended alternative to a linear model on log-transformed data (plus eventually adding a pseudocount or using vsn) would be to use a negative-binomial fit by a generalized linear model using the raw counts. Theoretically, this should be a good method as it seems to be current consensus to model RNA-seq variance by the negative binomial distribution. We have already collected a lot of evidence against the use of RPKM/FPKM elsewhere. RPKM is a convolution of length and library size normalization which introduces a library specific bias instead of removing it. In theory it might be better to model gene length explicitly as a model parameter if you think it is important, then you can check for a significant impact of it by comparing models.
Yes, it's one of the more enjoyable paper I have read. thanks Michael ;-)
ADD REPLY
• link
updated 2.9 years ago by
Ram
44k
•
written 10.5 years ago by
jack
▴
980
0
Entering edit mode
Thanks Michael, sure, could you please give the link of the first paper? (called "This paper"). when I click on it to get it, it's broken link.
ADD REPLY
• link
updated 3.1 years ago by
Ram
44k
•
written 10.5 years ago by
jack
▴
980
1
Entering edit mode
I have corrected the link, here is the DOI: 10.1111/j.2041-210X.2010.00021.x, it's not pubmed indexed. I don't think it is by any means an authoritative answer, but contains interesting arguments. (And it explains your question in the Introduction)
@Michael Dndrp, it was very interesting paper, and it seems that it ranked based bloom transformation work better. but ranking does not lose lots of information? Also do you know, how can I find the codes for the different transformations?
ADD REPLY
• link
updated 3.1 years ago by
Ram
44k
•
written 10.4 years ago by
jack
▴
980
Just add a small value (0.01, perhaps) and then take the log. You never get 0 values in microarrays due to background fluorescence and binding, which adding a small value somewhat mimics.
This might sound too simple, but I think ranking is the probably the best way to transform the data.
Depending on how you handled multi-mapped reads in your tag counting, I don't think the inter-sample comparisons of gene expression is valid. For example, I don't think you can compare the expression value of gene A vs gene B within one sample. If you only used uniquely mapped reads for your tag count, then you are introducing a "sequence redundancy" bias into your expression values. Transcripts that share common domains (thus generating multi-mapped reads) will be artificially under-counted vs transcripts that are totally unique.
If the inter-sample differences can't really be trusted, then a simple ranking is all that really matters.
But if you used some kind of multi-mapping strategy (RSEM, express), maybe standardization or variant stabilization will be more valid?
That's a nice paper. I really appreciate that someone finally looked at this question with some worthwhile simulations.
Yes, it's one of the more enjoyable paper I have read. thanks Michael ;-)
Thanks Michael, sure, could you please give the link of the first paper? (called "This paper"). when I click on it to get it, it's broken link.
I have corrected the link, here is the DOI: 10.1111/j.2041-210X.2010.00021.x, it's not pubmed indexed. I don't think it is by any means an authoritative answer, but contains interesting arguments. (And it explains your question in the Introduction)
@Michael Dndrp, it was very interesting paper, and it seems that it ranked based bloom transformation work better. but ranking does not lose lots of information? Also do you know, how can I find the codes for the different transformations?