Dear all,
I am using Oligo package to preprocess CEL files from Human Exon 1.0 st array. I have summarised expression data to the level of transcript cluster (rma(celfiles,target="core")
) and end up with a total number of 22011 transcript clusters. I would love to perform gene level analysis instead of trascript level, therefore I need to map the transcript clusters to the genes.
From the following command: featureData(exonCore) <- getNetAffx(exonCore, "transcript")
I have obtained the corresponding annotation file. However, when I looked into the annotation information from pData(featureData(exonCore))[,c("probesetid","geneassignment")]
, it looks like a few thousand transcript clusters do not have gene assignments at all. That may be a smaller of an issue but more importantly, a lot of transcript clusters are mapped to many gene symbols. The geneassignment
column has many entries.
When I take away the transcript clusters that are mapped to multiple gene symbols, I end up with around 12,000 or 14,000 transcript clusters that can uniquely map to genes. This number looks too few for me, as for example TCGA exon expression data contains about 18,000 genes.
Do I use the annotation file correctly or I have already misdone something here? Is that a generally better strategy to summarise to the level of probe set and then represent the genes with their constituent probe sets somehow?
Thanks
Thanks for your insight. Do you recommend summarising from probe level to probe set level (
target="probeset"
) before further summarising probe set level values to gene level values (for example by taking the mean/median across all probe sets in a gene); or summarising from probe level to transcript level (target="core"
), before mapping transcripts to the genes?I would recommend to summarize probe set level to gene level, as seen in a comment of a previous post.
Computing Expression From Affymetrix Exon Array Data
Thanks, appreciate this.