While reading about molecular subtyping strategies for various cancers, I have come across many papers talk about specific signatures that correlate to particular disease statuses and are defined by a collection of microarray probes.
For example, this paper defines an "EMT signature."
My question is, what is the exact nature of these signatures? Are they specific groups of probes and expected intensities for each of those probes? (btw, extra points to anyone who can direct me to the specific probe set that makes up the EMT signature, I couldn't find it in the text or the supplemental anywhere!)
I often see papers compare their microarray data to a given signature, and describe the process as simply calculating gene expression signature scores using averaged expression data, or average log intensities, etc. I would really like to be able to definitively define the EMT signature and learn how to compare my microarrays to that signature to determine if they fit.
Any help in this endeavor is much appreciated!
UPDATE: I was able to find in another paper a list of up-regulated and down-regulated genes that I guess define the EMT signature. Is that all a signature is? Anyway, I now need to be able to screen my microarray against this signature and statistically report whether it is a match, and I'm not sure how to proceed. One paper, Cristescu et al., describes doing this: "We calculated the gene expression signature scores using the average of log intensity (also known as the geometric average) of expression of genes in the signature." I want to replicate this method, but don't know what it is really saying.
In the same paper, the authors later go on to explain that they used the EMT signature and another signature (called the MSI signature) to classify some microarrays. They explain, "The distribution tails of MSI and EMT signatures exhibit a mutually exclusive pattern and thus identify the groups of samples in the MSI and EMT groups, respectively." Whatever they did here is what I want to do, since I am looking to classify my samples in the same manner.
In answer to your update, that is usually what people mean by a signature.
You have only one microarray, or a set of them?
I have >50 microarrays, each representing a different sample, and I want to screen all samples for the EMT signature.
Do you have any "control" samples such as normal tissue? I don't really know of tools that take raw data and perform gene set analysis without a process of identifying differentially-expressed genes.
Yes, I have normal tissue controls. My difficulty isn't calculating DGE or something, but how to compare the samples to the signature and know whether they are a 'fit' statistically.
You can upload your data to one of these programs and then upload your gene set of choice and preform a gene-set enrichment analysis.
I appreciate the links and am now looking into using their software packages. Is this such a large task that it is difficult to do in R? I figured such an analysis wouldn't take more than a few dozen lines of code. (I'd prefer to do this from scratch myself to ensure I understand everything that is happening to the data)
Hmm.. Sorry, but I'm not a big R expert.. I know that the cogena package package allows you to find enrichment within gene clusters, so maybe that's useful for you. I suggest your look at bioconductor.org - I assume there are some packages that can accomplish the task at hand