Hi guys,
we are working on an university project where we want to find discriminating genes of different cancer types. For that we are using gene expression data of the TCGA dataset.
A naive approach would be to simply run some feature selection on tumor data of each type. However, we assume that this would not identify genes relevant for the tumor but the cell type itself. For example, we want to compare thyroid and lung cancer. Using only tumor data, we would expect that we find differentially expressed genes that are specific not for the tumor but for the original cell type itself. So we want to "normalize" thyroid tumor data with healthy thyroid tissue to find discriminating genes for thyroid tumor first that can now be compared with "normalized" genes for lung cancer.
We have some ideas how to do this ourselves but we suppose that this is not an uncommon task, so has anyone heard of this "normalization" approach and how it usually is done? We suppose that this needs to be done when clustering cancer types to see meaningful differences but we could not find this in the literature we read.
We hope we could state our problem in a comprehensible way, if not, feel free to ask. Thanks for your help!
What about this : Genetic effects on gene expression across human tissues
Can you block on tissue origin? For example, the same way you might incorporate a batch effect (~ batch + group) you instead incorporate tissue origin (~ tissue + tumour)