241838_at is a significant hit in a gene expression analysis that I am currently working on. Affymetrix annotation provides Gene Symbol for this probe as
"chr6:167330486-167330903 (-)"
with additional notes "This probe set was annotated using the Accession mapped clusters based pipeline to a UniGene identifier using 5 transcripts.".
There is no further annotation available for this probe in ADAPT, GATExplorer or AILUN. As this particular probe is a significant hit, I would like to know how can I report this. I would like to know the community is dealing with results based on such ambigous probes ? What could be the reason for Affymetrix to keep such a non-specific (GATExplorer says no genes are mapped to this probe) probe in the chip ?
Why not look at the [?]probe alignment itself[?] on Ensembl? In this case the probe is intronic to the processed but noncoding transcript RP1-167A14.2. There are ESTs overlapping the probeset which are likely the source sequence used as evidence for inclusion of the probeset.
Affy tends to put every possible exon on the probesets and let the users puzzle out which ones are real rather than stick to a minimal canonical set of genes which may be proven wrong in the future.
You may also want to check the individual probe values for this probeset and reconcile them with any spurious mismatch alignments with other RNA species that could be causing off-target signal before proceeding further.
As Daniel points it out there has been a drift between the "transcriptome as we thought we knew it" when arrays were designed and "the transcriptome as we know it today" (shameless plug to an early reference where this was called a "Dorian Gray effect").
If you are using bioconductor to perform the analysis, do consider using probe remapping to perform the same analysis (the MBNI provides regular updates of mappings built against RefSeq and other databases - latest is from July 2010).
I have also found the "customCDFs" (linked above) to be extremely useful. In a recent study I used both the current standard Affymetrix annotations and custom annotations to identify ~100 probesets useful for a specific classification problem. Manual validation of these probesets by alignment to reference genome found that ~10% of the standard probesets no longer work given our current understanding of the transcriptome (the problem is usually ambiguous assignment of probes to multiple loci). CustomCDF annotations had an almost perfect validation rate (unambiguous alignment to expected locus).
One caveat - occasionally the customCDF probesets do not perform as expected. For example, U133A probesets for ESR1. From the standard CDF, only a single probeset out of nine (205225_at) works well for distinguishing ESR1+ from ESR1- patient samples (PMID:17329190). The single customCDF probe set for ESR1 doesn't work either, although alignment to genome doesn't reveal obvious problems. So, in this case, using customCDF will have poor results for an important gene. This experience has led me to use both custom/standard probeset annotations and sort out best probesets downstream.
Basically, I have run the HG-U133_Plus_2.probe_tab file (downloaded from Affy) through my X:Map pipeline to get probe->genomic locations mappings. (The same as I used to do for ADAPT, but ADAPT just scanned CDNA sequences). I get the probe tab file, extract the probes, and then run them all through Bowtie (after generating the bowtie index for the Reference Genome of interest).
I think the point is when the U133plus2 chips were designed (I think this probe is from that chip from a quick look at NetAffx) there were a number of cDNA transcripts - indeed in this case a cluster thereof, potentially of unknown function that were used to design the probesets against. Over the course of time, this hasn't become a 'gene' or indeed any particular feature that we would find mapped onto a genome build.
So this boils down to a few things really, either you check your probes against a new build of the genome to make sure each one maps to something we recognise as 'real' or you use a remapped cdf file for your analysis (discussed in answers passim).
You could check the original IMAGE clones (etc. listed on NetAffx) to see whether they have been quietly sidelined, or indeed map to where you think the probeset should on a genome build.
Personally I report Affy accessions rather than gene names when reporting data. It's up to somebody else (perhaps) to disambiguate the situation. Sometimes these arrays throw up things you would spend more time chasing down than is useful or practical.
Thanks Daniel. This probe is significantly expressed in 2 different set of experiments with 10 replicates. I am reporting both genes and the probe ids in the results page, For this particular probe I am planning to report using the probe id.
All answers are nice and helped me to get a new insight in to the problem. I will be selecting best answer as the one with maximum votes by next week.