I have been looking at the relationship between log transformed estimated RNA-seq counts and microarray (HG-U133a) gene expression. Luckily, TCGA has both kinds of data for several patients. I began by comparing the preprocessed data sets from TCGA which are already on the gene level. While pleased with the results for the most part, I decided to download the micro array CEL files and RMA process them so that I could look at the probe set level. Interestingly, some of the probe sets have very different distributions, but are mapped to the same gene.
I am curious why this happens. My first thought was that this has to do with the suffixes of the probe set IDs. I've found information about what the suffixes mean from Affymetrix's webpage. I'm a little confused by what they mean. To be more specific:
_at
= all the probes hit one known transcript._a
= all probes in the set hit alternate transcripts from the same gene_s
= all probes in the set hit transcripts from different genes
...
For HG-U133, the
_a
designation was not used; an_s
probe set on these arrays means the same as an_a
on any of the HG-U133 arrays.
This quote is from http://www.affymetrix.com/estore/support/help/IVT_glossary/index.affx, and I'm assuming that the mention of HG-U133 includes HG-U133a. The last sentence is the confusing part. Is it saying that an _s
probe on HG-U133 array means the same as an _a
probe for the arrays that actually have _a
probes?
I suppose my main question is, if I see very different distributions of two probe sets that map to the same gene, what could that mean? If they are measuring the expression of different isoforms/transcripts of the same gene, how can I find out which ones each probe set is measuring?
Thanks for any insight
Thank you for the links. They are quite helpful. I'm looking at EGFR, and it seems to make sense after looking at the USCS custom tracks and my distributions at the same time. I think I'll be able to look at exon info from the TCGA data as well to see if the RNA seq data is picking up the same isoform as the microarrays. However, I'm still not certain about what the "_s" suffix means in either the case of HG-U133a or otherwise.
I also found the verbal explanation confusing, but this page has a figure worth a thousand words.
I've seen this image before, but the link that I've posted made me think that HG-U133a might be an exception to this rule. Just to be clear, does the "_s" mean that it's probing a sequence that can be found in multiple genes? I can't actually find an example of a probe mapping to a gene other than the one listed on the UCSC browser. For example, this probe only mentions chr7:55086725-55275772.