Hi
I'm investigating differential gene expression between tumour and matched-normal samples from the TCGA (breast cancer). Following differential gene expression using DESeq2 (design : ~ patient + sample_type), I visualised the differences between sample types (tumour / matched-normal) with a heatmap for my genes of interest.
Here is the heatmap showing expression for my genes of interest between the two groups (TP == Tumour, NT == Matched-Normal)
The expression of my genes of interest (which, according to literature, are cancer-related) appear to be very dependent on patient. The order of patients are the same in the two clusters (i.e. patient 1 is column 1 in the TP group and column 1 in the NT group), and the two groups (NT and TP) share a very similar pattern ...
I wondered whether the effect of patient was perhaps not being accounted for effectively by the design formula I used for DESeq2, so I tried removing patient as a batch effect first with Combat_seq and then using the adjusted count matrix as input for differential expression, with just sample_type in the design formula.
Now, the heatmap loses the pattern of similarity between patients, and resembles something more expected, with some gene expression differences showing up between tumour and normal
I don't think removing the effect of patient like a batch effect is a valid approach, but I'm not entirely sure what else to do, since I've already tried to account for patient in my DESeq2 design formula, which didn't seem to entirely remove the effect. Does anyone have any suggestions?
Thank you for your time!
Thank you for your input! Just an update: I used sva_seq instead of combat_seq to remove unknown and unwanted variation, and then added the first surrogate variable to my design (i.e ~ SV1 + patient + sample_type), as per this workflow: http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#removing-hidden-batch-effects