Can LASSO be applied after converting and merging different RNAseq datasets into the same type of normalization data (e.g., FPKM, RPKM, CPM)?
1
2
Entering edit mode
10 months ago
memrekus ▴ 10

Hello everyone, this is my first question on this platform. I'm not a statistician but trying to understand if can we use differentially normalized data (eg. FPKM, RPKM, CPM) to merge and apply LASSO regression for comprehensive RNAseq cancer data analysis.

Reproducible Example

install.packages(c("reprex", "glmnet"))
library(reprex)
library(glmnet)

#Set seed for reproducibility

set.seed(123)

#Create sample FPKM data

fpkm_data <- matrix(rnorm(1000, mean = 10, sd = 5), ncol = 10)
rownames(fpkm_data) <- paste0("Gene", 1:100)
colnames(fpkm_data) <- paste0("Sample_FPKM", 1:10)

#Create sample RPKM data

rpkm_data <- matrix(rnorm(1000, mean = 5, sd = 2), ncol = 10)
rownames(rpkm_data) <- paste0("Gene", 1:100)
colnames(rpkm_data) <- paste0("Sample_RPKM", 1:10)

#Merge the datasets

merged_data <- cbind(fpkm_data, rpkm_data)

#Create outcome variable (response)

outcome_variable <- rnorm(10, mean = 15, sd = 5)

#Apply LASSO regression

lasso_model <- cv.glmnet(x = merged_data, y = outcome_variable, alpha = 1)

#Display the results
summary(lasso_model)

I would greatly appreciate it if anyone with knowledge or insights on this topic could kindly provide their input. Your responses would be highly valued. Thank you in advance!

Statistics ML Lasso • 1.1k views
ADD COMMENT
2
Entering edit mode
10 months ago
LChart 4.6k

You should not be combining different normalizations for this model, as the scale of the data (both mean and variance) will be different for the different normalizations, and therefore the estimated coefficients must also be heterogeneous across the datasets.

If you have a large number of samples per dataset, and the datasets are all roughly sequencing the same population, you could percentile or quantile normalize the data within gene (i.e., use the CPM dataset as the "target distribution" for RPKM/FPKM).

But the best way would really be to go back to counts and ensure the same normalization is applied where possible.

ADD COMMENT
0
Entering edit mode

Thank you for your response, @lchart. I'm also interested in whether log normalization is necessary after RPKM or CPM for LASSO.

I intend to perform a comprehensive LASSO analysis by merging different RNAseq datasets. Which normalization method would be more reasonable for standardizing all the diverse datasets? Is it preferable to apply a log transformation to the normalized values (using either (CPM) (or RPKM) as the normalization method), or is a standard normalization alone more likely to yield better results?

ADD REPLY
1
Entering edit mode

Because CPM/FPKM/RPKM normalization are based on counts, these normalized values are strictly positive and (for each gene) tend to follow something like a gamma or lognormal distribution. However looking across genes, the mean and variance vary by several orders of magnitude, and the variance typically is a super-linear function of the mean ("overdispersed") -- hence the justification for using negative binomial as a typical count model.

This means that for normalized values like CPM and RPKM, the location and scale of each gene are not commensurate, so using these values naively (e.g., without weights) in a penalized approach will penalize high-mean genes far more than low-mean genes. This can be mitigated by scaling to mean-0, variance-1 within each gene.

Log-transforming via log(1+cpm) (etc) helps to "stabilize" the variance (as log(1+E[X]) != E[log(1+X)]) and also places mean expression values on roughly the same scale. That said, if one is going to normalize all gene expression distributions to mean-0, variance-1, there is no technical reason to prefer log-transformed CPM over CPM. However in all practical scenarios downstream methodologies such as differential expression, gene co-expression, and clustering perform "better" (in terms of reproducibility or capturing positive controls) following log normalization (this for single-cell deconvolution, is an exception that proves the rule), so one might reasonably conclude that log-transformed, normalized values are "closer" to the relevant biological measure than un-transformed normalized values.

When it comes to transcript length normalization, there are clear arguments to prefer TPM to RPKM/FPKM, as TPM is comparable across samples. So the real choice is between TPM and CPM. [The following is strictly my opinion based on conceptual arguments] If data quality is generally the same across your set of samples, then CPM should be preferred, as the assignment of reads to transcript and determination of "effective" transcript length necessarily adds noise into the system for what is, effectively, a constant that you throw away by setting all genes to mean=0, var=1. Indeed, simple count normalization does appear to perform better in replicates. On the other hand, if samples differ substantially by RIN to the point that effective transcript lengths are expected to differ, then TPM normalization may remove some of this artifact above and beyond incorporating RIN as a covariate into the model. This is simply an idea of where TPM may be doing something more than adding a normalization factor that is subsequently thrown away.

In short: log(1+CPM) is a really good starting point, though other normalization factors besides total count (effective library size, TMM, median) may be more appropriate for TCGA specifically - there could be established benchmarks. I'd stay away from length normalization unless there's a really convincing argument that it should be there.

Finally, you have far larger fish to fry in "comprehensive" pan-cancer analysis such as dealing with copy number alterations and fusion genes, and your time may be best spent using CPM due to its simplicity, and focusing on those other issues that may be more pertinent and valuable.

ADD REPLY
0
Entering edit mode

I am really grateful for your answer @lchart., it was very beneficial for our project.

ADD REPLY

Login before adding your answer.

Traffic: 1695 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6