Hello all,
I want to ask for your opinion. First of all I will explain my problem. I have around ~1000 genes of interest and all of them are transcription factor. I want to see the expression profile and do diff. exp. analysis between cancer-normal. I downloaded all TCGA BRCA dataset from GDC. I got more than 1000 samples, around 300 for normal and the remaining are cancer.
After I do diff. exp. analysis using DESeq2, I only got several differentially expressed gene from my genes of interest list and feel weird about it.
Then I tried to subset the dataset. I use only 10 for both normal and cancer. I just choose randomly from the sampel I have downloaded. Then, I ran DESeq2 again and the result is quite normal with a lot of differentially expressed genes.
My questions are:
Why using many samples will give weird result (only several genes are differentiall expressed)? Does this means there are "subtypes" in those 1000+ samples of BRCA (my hypothesis are because the sample variances are huge, DESeq2 can calculate differentially expressed gene accurately)?
If I want to choose samples that have similar gene expression profile from this 1000+ samples, what is the best method? I know K-means clustering and hierarchical clustering?
I use |0.65| threshold for significantly up/down regulated. I agree that larger number of samples will give stronger statistical result but in this case, I am wondering whether there are variation in the tumor gene expression profile that make analysis for some genes don't give a strong result. That's why I think maybe clustering the sample first will be more useful so that I can use smaller size of sample group with the most similarity in expression profile. Smaller size can make analysis is quicker.
Data from TCGA is from RMA-seq. The data I downloaded is in the form of htseq-count result. Maybe I will try R clustering package and see the result but at first maybe see the data first by using standard method like PCA or heatmap.
Visualizing the data is always a good idea. Another thing you could do is filter out some genes, for example those that can be considered as not expressed.
thank you for your suggestion.