Question

How To Identify The Method Used To Reduce The Number Of Probesets In A Cel File Obtained From Arrayexpress

1

Entering edit mode

11.8 years ago

mtyler.jason ▴ 120

Hi all,

I was going through this gene expression data at http://www.ebi.ac.uk/arrayexpress/experiments/E-TABM-157. It has both the raw CEL data and the processed matrix data. I have a question. It uses chipset HG-U133A which has around 22125 probe set ids. If you look at the original CEL file it has around 540909 probes. However,in the processed matrix file you have the 22125 probe sets and their corresponding intensities. I wanted to know how the 540909 probe intensities are filtered to get the corresponding 22125 ones.

I am confused how the preprocessing is done. Suggestions?

gene-expression probeset • 3.7k views

ADD COMMENT • link updated 11.8 years ago by Obi Griffith 20k • written 11.8 years ago by mtyler.jason ▴ 120

score 3 · Answer 1 · 2013-02-21

This is actually a big question. It is often the case for Affymetrix GeneChip data that you have both raw (CEL) files and pre-processed data made available through GEO, ArrayExpress, etc. The CEL file contains intensity values calculated from the actual scanned array images (DAT files). The CEL file together with a CDF file (which describes the layout for an Affymetrix GeneChip array) can be used to calculate an intensity value for each probe. However, individual probes are rarely used in downstream analysis. Instead they are usually summarized together at the probe set level. When Affymetrix designs a GeneChip they target a certain number of specific gene loci and design a set of oligo sequences from an exemplar sequence for each target. Typically there are 11-20 unique oligomeric probes, each 25 bases in length for each targeted gene or transcript. For each oligo probe which matches the target sequence perfectly (PM probes) there is also a corresponding probe with a single mismatch (MM probes). This design explains how you can have 540909 probes which actually represent 22125 probe sets. However, there are many different ways to get from probe intensities to probe sets summary values. Affymetrix provides algorithms (e.g., MAS5 and PLIER) which combines the values from all PM and MM probes into a single estimate of transcript level for each target. Other popular algorithms ignore MM probes (e.g., RMA) and try to account for hybridization effects related to GC content (e.g., GCRMA). To further complicate matters, several groups have redefined the original probesets from Affy by using a more current reference genome and understanding of the transcriptome to produce custom CDF files with different numbers of total probe sets and probes per probe set.

For the specific data set you linked to (E-TABM-157), the ArrayExpress citation looks wrong to me. I believe the original paper can be found here. In their methods you can see that they processed with RMA in R/Bioconductor. This is a very common approach.

Here are some links which might help you understand more:

score 1 · Answer 2 · 2013-02-21

1

Entering edit mode

11.8 years ago

Sebastian Kurscheid ▴ 300

Take a look at the Affymetrix data sheets, e.g. here http://www.affymetrix.com/support/technical/datasheets/hgu133arrays_datasheet.pdf

Quote:

Comprised of more than 22,000 probe sets and 500,000 distinct oligonucleotide features.

The CEL file contains the signal intensities for these "oligonucleotide features" which are summarized to a probe level intensity (during image-processing and incorporating information extracted by the GenePix scanner when digitizing the original array). The probe level data (in my opinion) is the only data of interest for further analysis.