I possess RNA-seq data that's TPM normalized, sourced from different origins. I merged these datasets and then applied log2 transformation followed by batch effect correction. These steps ensured that all samples approximated a similar range, making them crucial for consistency.
While I understand that differential expression analysis is typically done on raw counts, I don't have that data. While Limma's voom approach works for normalized data, is it still applicable after log2 transformation and batch effect correction?
Conducting differential expression analysis on my merged TPM data without these transformations might not yield accurate results due to the discrepancies in some sample values. What's the recommended approach?
The recommended approach is to use batch as a covariate, not perform DE on batch-corrected data especially TPM which is inherently incomparable.
I see, so currently my design looks like this (Benefit has two values only):
I'm removing batch effect for cancer type (I have 4 cancer types). You're saying that my design should be like this ?
Your batch variable is Cancer Type?
Yes, it's a pan-cancer study
Why are you batch correcting data where the batch is such a critical biological variable? That makes no sense.
Because I want my model to be able to classify response, regardless of cancer type or anatomical location.. I can basically use the cancer type as a predictive feature, you're right this can be an important variable so it's an option.
You seem to have a really nice machine learning background but not a great cancer background. Cancer is heterogeneous even within a specific cancer type, how do you expect your model to classify response just based on horribly mangled generic RNA-seq data? We use multi-omics on highly specific cancer subtypes and our models are not all that amazing, I don't see how removing critical biological information is going to give you anything better than a crapshoot.
Are you suggesting that I should disregard batch correction for the cancer type and instead incorporate it as a predictive feature in my model? Additionally, I have other variables like gender, treatment type, and outputs from various deconvolution algorithms indicating cell abundance for each sample. So I use those cells as predictive features as well.
I don't know machine learning, so I can't speak to "incorporate it as a predictive feature". My point is that treating cancer type as a mere batch variable will result in immense loss of context. Given how narrow your data is, such a broad question will not work in your favor. But I'm no expert on machine learning so you might stumble upon something. I'd recommend you consult some folks that have experience in cancer RNA-seq and make sure you understand what you're expecting from your data.
I see. You're right, we decided not to correct for cancer type, and to use it in the predicting process. Thanks for the help!