i am currently working with TCGA dataset from UCSC xena browser. I have completed the differential gene expression analysis using Deseq2 from genepattern for one cancer dataset. I have some doubts regarding the result and want to know how to proceed further for analyzing across different cancer samples. I am very new to TCGA and currently doing the analysis based on
Following are my doubts regarding the analysis
1. HT seq count file downloaded from xena has transcript id's, i want gene id's for my analysis. How should i do this?
2. For generating a heatmap for DEG's of different cancer dataset should i use the log2 expression values from DEseq2?
Towards 1) You should check how exactly this file has been created. If it is indeed transcript level then aggregate it to the gene level with tximport which you can then seamlessly integrate into DESeq2. Check the respective manuals. Code is given there.
2) I would use Z-transformed log2 expression values for clustering. This could be the log2-transformed values from DESeq2 itself or you use vst or rlog on the raw gene-level data again. The latter two are already log2 after running the command. Given a data frame with FCs you can do t(scale(t(fc.matrix))) to get them. This will focus the clustering on the relative differences between the samples for each gene and is robust against outliers e.g. some genes showing extreme fold changes as the Z-FCs are a relative measure for each gene indicating how much each sample diverges from the mean of all samples for each gene. See e.g. the Wikipedia article on Z-scores (standardization).
There is no need to SHOUT. I have removed the excessive uppercase letters from your title.