Hi!
My problem concerns categorizing patients to groups based on continuous variables. From the previous studies we know that there are continuous differences in mean expression of two signatures, which are negatively correlated. We are interested in comparing two extreme groups in terms of differentially expressed genes. Is there any statistical method for determining the cutoff from tha data? Maybe some measure of similarity we could use? Would it be reasonable to cluster patients based on those two signatures and in that way choose extreme groups?
Any advice will be appreciated.
Regards, Agata
Instead of using a single mean value for a signature you could try to cluster the samples using the expression of all genes present in the signature. This could help filter out some of the likely noise coming from genes that are part of the signature but that don't vary much in your data. The approach you described is otherwise reasonable.
Yes, this was more or less my reasoning: to cluster patients based on all genes in both signatures and then set a cut-off on the branches. Would it be reasonable then to perform transcriptome-wide differential gene expression testing and co-expression analysis between two groups on such classified data?
Yes I think that's the best solution you can get. Also, you don't necessarily have to bin the samples in two groups: you can perform a differential expression analysis looking for gene patterns that correlate with a continuous variable. You can model your data on the gene signature score (in DESeq2, voom for example).
Thanks a lot! I will definitely take a look into that - in general those sigantures mean expression is correlated with the level of differentiation and what is of interest to me is what are possible underlying differences between highly and lowly differentiated tumors that's why my first thought was to zoom in to extreme groups.
Also since we're discussing, do you think performing co-expression analysis to find networks of genes would make sense and would have to be performed on the whole dataset (we have no control group) or rather on extreme groups to compare them? I wonder if reconstructing networks from whole dataset wouldn't be biased due to tissue-specific expression.
The clustering idea is great. If, in addition, you are interested in segregating patients based on the expression of just one gene of interest, then you could literally divide the patients into tertiles, quartiles, quintiles, et cetera, and then compare the top and bottom groups.
A co-expression network would also help, and it is possible to identify sub-groups (communities or modules) in such networks, and to then see how these sub-groups relate to yuor clinical variables. On the issue of tissue specificity, it's up to you to ensure that your samples are from the same tissue and that there is no bias in that sense. A good study design guards against biases like that.
About the tissue specificity - the data I have is only tumor data, we have no controls for that and what I mean is that some of the genes might be co-expressed due to the tissue of origin. I wonder how I could account for that - should I look for control data in databases (problem is that sample sizes are usually very small) or can it be considered during functional analysis?