Question

Problem Understanding Affymetrix Gene Expression Data

6

Entering edit mode

14.3 years ago

Pi ▴ 520

Greetings

I am doing Bioinformatics project based on microarray gene expression data and there are some basic issues I don't fully understand. I was hoping members of this forum may be able to help me. Please could you address the following points in turn

Is there a naming convention for Affy probe sets? This is an example page from GEO and it seems as though the probe sets have a naming convention but I cannot figure it out. Some names end in 'at' and others end in 'st.' Many names have '-5' or '-3' or 'M' in them too.
How can probes distinguish between mRNA that has and has not been processed (e.g. intron splicing). Is this possible? I expect most researchers want to know the processed mRNA (see next point)
How do probes in general account for the fact that genes can have specific transcript variants? Does the probe target a common sequence in all isoforms or do you get different probes for the different transcripts? Examples would be helpful. I presume most researchers want to know which specific transcript variants are present in a cell
How do probes account for sequence variations such as SNPs? A variation within a gene shouldnt affect the level of transcription of a gene (or should it?) but it could affect the binding of a transcript to a target probe. Are probe sequences designed such that they exclude known SNPs
probe sets contain a set of overlapping probes for a target sequence. Do you expect a target mRNA sequence to bind equally to each of these probes? Do the statistical analysis take into account the 'average' binding of an mRNA to all of the probe in a probe set to give a picture of the expression level of an mRNA?

Thank you for your time

microarray gene affymetrix • 5.4k views

ADD COMMENT • link updated 14.3 years ago by Chris Evelo 10k • written 14.3 years ago by Pi ▴ 520

score 12 · Answer 1 · 2011-04-11

12

Entering edit mode

14.3 years ago

Chris Evelo 10k

You can find some info on probeset naming here. More details about 5', 3' and middle (M) control probesets can for instance be found in this document.
Affymetrix has specific arrays for splice analysis. For these you need different labeling and amplification methods since you want to detect all the mRNA with the same sensitivity. The best way to determine what splice variants you really have is to use probes that span exon junctions. For instance if you see a signal for the connection between exon 1 and 3 you know that there must be an expressed sequence that lacks exon 2.
I think answer 2 already answers your 3rd question as well. Normal Affymetrix expression arrays are not splice variant specific, but the full exon arrays are.
You are right. Being 1 base off is enough to lower the binding to the short sequences that Affymetrix uses. So normally a single probe in a probeset will not be enough to detect a variant, it will just give a lower signal for that single probe (which is statistically filtered when the whole set is evaluated). On SNP arrays that are designed specifically for this purpose that same variation is used to detect the variants, which can be done with a high level of precision. Although you would normally target DNA and not RNA with that.
The overlapping probes definitely do not all give the same signal. And yes standard processing and statistical evaluation does take this into account and removes a large fraction of the problems that are introduced in this way. On traditional arrays the signal is first of all different because of the amplification (the further you get along the amplification path the weaker the signal is). The absolute probe sequence also influences the signal. Things like GC content and the presence of possible hairpins are for instance important. The latter factors also has an effect on how well (how linear) you can amplify RNA, this important because amplification is in fact part of standard Affymetrix processing during the labeling. Let me plug one of our own publication on problems with biopsies because of this which you can find in BMC Bioinformatics.

For more details please check the faq page mentioned page and other Affymetrix faq pages those will probably answer most of your questions.

ADD COMMENT • link 14.3 years ago by Chris Evelo 10k

1

Entering edit mode

It actually has nothing to do with PCR amplification, because PCR amplification is not used. Rather, the 3' bias comes from the step in which mRNA is converted into cDNA by reverse transcriptase (RT), using oligo-dT to compliment the poly-A tail of the mRNA. RT is not very processive, and signal decays the further one gets form the poly-A tail. The amplification step comes from using RNA polymerase T7 which linear amplifies the cDNA from a T7 promoter that was part of the original oligo-dT cDNA primer. T7 is a more processive enzyme that RT, so the bias is mainly introduced in the RT step.

ADD REPLY • link 14.3 years ago by seidel 11k

0

Entering edit mode

Thank-you kindly for your response. 1) The probeset naming link does not mention the significance of the '3' or '5' or 'M' 2) and 3) Thank-you for pointing out full exon arrays. I was not aware of these. 4) I am not interesting in detecting variants; I was interested in how the chips 'work around' variants

ADD REPLY • link 14.3 years ago by Pi ▴ 520

0

Entering edit mode

5) I was not aware of the amplification issue and I don't think I fully appreciate it. Doesn't PCR amplify the whole isolated mRNA sequence? Are you saying that PCR gives amplified fragments of different lengths with more shorter fragments? I only have text-book knowledge of PCR

ADD REPLY • link 14.3 years ago by Pi ▴ 520

0

Entering edit mode

No it's not really a PCR problem perse. It is more degradation of the mRNA sample that starts on one end that causes the effect. We have a graph and explanation at www.arrayanalysis.org. Clisk "sample prep controls" and then look for "Overall RNA quality control: RNA degradation plot" (and please be aware that this is just our implementation of an existing Bioconductor module)

ADD REPLY • link 14.3 years ago by Chris Evelo 10k

0

Entering edit mode

The '3', '5' and M probesets are control sets for these specific regions. I have added a link in the answer to a text describing how you can use those.

ADD REPLY • link 14.3 years ago by Chris Evelo 10k

0

Entering edit mode

Absolutely true @Seidel in fact both the amplification and used in our paper and the normal labelling procedures are T7 based: RNA amplification

"Total RNA isolated from the LV biopsies was amplified for a second round using a protocol largely based on the linear T7-based procedure described by Baugh [16], with some minor modifications, and thereby resembles the current Affymetrix protocol for first round RNA amplification (GeneChip® Two-Cycle cDNA Synthesis, round 1)."

ADD REPLY • link 14.3 years ago by Chris Evelo 10k

0

Entering edit mode

Absolutely true @Seidel in fact both the amplification as used in our paper and the normal labelling procedures are T7 based: RNA amplification "Total RNA isolated from the LV biopsies was amplified for a second round using a protocol largely based on the linear T7-based procedure described by Baugh [16], with some minor modifications, and thereby resembles the current Affymetrix protocol for first round RNA amplification (GeneChip® Two-Cycle cDNA Synthesis, round 1)."

ADD REPLY • link 14.3 years ago by Chris Evelo 10k