I have multiple bulk RNA-seq datasets, all of which have been normalized to TPM. However, one or two of these datasets have undergone different normalization techniques. To standardize the scale across all datasets, I performed upper quartile normalization (UQN) followed by a log2 transformation.
I am now looking to integrate these datasets to analyze them using the IMPRES algorithm, which predicts responses to immunotherapy. However, I have some reservations.
IMPRES operates by comparing the expression levels of checkpoint genes, and accordingly assigns scores to each sample. I am uncertain about the most effective method to normalize the data prior to employing the IMPRES algorithm.
Would the application of a log2 transformation drastically alter this process? Additionally, what impact would UQN have?
Another aspect I am contemplating is whether it is more beneficial to analyze each dataset independently within the IMPRES algorithm, or should I first combine them, correct any batch effects that may arise due to different sources, and then proceed with the IMPRES analysis? Your help is much needed. Thank you.
You UQ normalized TPM values across datasets?
Yes, I have applied UQ normalization to each dataset individually. This was necessitated by the fact that some of the more extensive datasets that I have collected underwent additional normalization processes beyond TPM, which included batch effect corrections, among other things. (I don't have the access to the counts data unfortunately).
Consequently, it was important to harmonize the scale across all datasets. Furthermore, I implemented a log2 transformation; however, I still have some reservations regarding the efficacy of this step, especially in the context of the IMPRES algorithm.
The log2 doesn't matter here, it only serves to transform the range of values, it doesn't affect the compatibility IMO.
TPM is not comparable across samples, and applying additional transformations only tortures the data further. Is it possible to email the sources and request count data? That would be the best way to deal with this. TCGA uses FPKM-UQ but it's all buyer beware and I am highly skeptical of these impossible-burger beyond-meat ultra-processed metrics
I understand... but sadly, I can't ask for access to count data. I'm not using the gene expression data by itself; instead, I'm putting it into computer programs like CIBERSORT, IMPRES, TIDE, and others. Do you think using TPM with UQ normalization might not work well for this?
By the way, the TIDE and xCell programs need the data to be TPM normalized. It says that in their guide. I'm not sure about the others.
If they expect that, you should be fine with TPM, but adding further "normalization" is not recommended. I've never used TIDE or xCell or Cibersort, but I think you should read through their documentation/paper and ensure they're fine with data containing batch effects