Hello, bioinformatics community
I am currently working on a pan-cancer differential expression (DE) analysis and am undecided about which dataset processing pipeline to use. Here's what I have done so far:
Using TCGA Data from Biolinks:
- I downloaded data for multiple cancer types via TCGAbiolinks (R).
- Integrated all cancer types and removed batch effects using ComBat-seq.
- Performed DE analysis using DESeq2.
Using TOIL Processed Data from Xena:
- I downloaded TOIL-processed expected counts data from the Xena hub.
- Since the TOIL data is log2(x+1)-transformed, I reversed the transformation and then performed DE analysis with DESeq2.
Observations:
Interestingly, the differentially expressed genes (DGE) identified from TOIL data were closer to the DGE I obtained from CCLE data (via DepMap), which made me lean towards using the TOIL pipeline for my study.
However, something still bothers me, and I am not confident that this choice is correct.
Are there significant downsides to using TOIL-processed Xena data instead of TCGA Biolinks data for DE analysis?
I am looking for insights or validations to ensure my approach is methodologically sound.
Thank you for your help!