What is the putative mechanism behind the GC content of probes in a microarray causing waviness?
EDIT:
Here is an example of "waviness" as discussed in this paper. Basically (blatently copying-and-pasting from the article): genomic waves are not platform-specific and are characterized by wavy patterns of the Log R Ratio, with identical or opposite peaks and troughs.
Log R ratio is defined as: For each SNP, let the signal intensities for the A and B alleles be denoted as X and Y, respectively. We can then calculate the R-value as Robserved = X + Y. As a normalized measure of total signal intensity, LRR is then calculated as log2(Robserved/Rexpected), where Rexpected is computed from linear interpolation of the canonical genotype clusters.
It does propose however, that the quantity of DNA could be responsible for waviness. The role of GC content in this observation might be still an open question.
Can you define waviness? I read the little excerpt at Affy, and I still don't get it (since they don't define it either). Hi GC content leads to increased Tm of probe hybridization, perhaps it has something to do with that, although since I don't know what waviness is, I'm not sure how Tm is relevant.
As per paper referenced by Neilfws: "a genome-wide spatial autocorrelation or ‘wave’ pattern in signal intensity data was described that interferes with accurate CNVdetection", which was found to be highly correlated with GC content. It probably does have something to do with Tm
Summarizing what's already in the comments and adding a few of my own educated guesses, I think that the answer to your question is:
The human genome contains large regional variation in GC-content -- presumably in "waves".
Probes with high GC content bind better to their target sequence (because of the higher Tm), so given the same amount of DNA, the probes with the higher GC will tend to have a higher intensity.
If this tendency is not corrected for, this might be a problem for algorithms that detect copy number changes because these algorithms detect copy number increases by scanning for chromosomal regions with consistently higher intensity -- but because of (2), these could simply be regions with higher than normal GC content.
"...given the same amount of DNA, the probes with the higher GC will tend to have a higher intensity." Not at equilibirum. The hybridization signal is used to reflect differences in concentration of a given sequence, and those labeled sequences are hybridized long enough to have reached an equilibrium state between the liquid and the hybridization substrate. Also, the paper cited above describes that the waves can occur in either direction for the same locus - thus higher than average GC content can lead to an inference of copy number increases or decreases.
In practice this is sometimes the case, but there are two things to consider. 1.) Probe sets for mRNA expression arrays are typically designed to all be within a very narrow Tm range, and ideally they would all have the same Tm, such that they would have near identical hybridization characteristics. 2.) Array hybridizations are performed below the Tm, to favor probe hybridization, and then the washing of the arrays is designed to wash away imperfect matches. Perfect matches should be preserved, thus even if two probes differ in Tm, they should have intensity based on the original concentration of the complementary molecule, if the stringency of the wash step is not so severe that it disrupts perfect matches. In practice, it's hard to design an entire probe set with a really tight Tm range. To compensate for this, the assays are typically comparative, so that relative intensities can be compared (between differentially labeled samples), rather than absolute intensities between spots, or between separate single channel hybridizations. And of course with genomic tiling arrays, the Tm's vary quite a bit, because the probes are typically picked without bioinformatic optimization for Tm, thus the resulting range is too wide to pick hybridization and wash conditions optimal for the entire probe set without adding non-standard salts to the mix.
Hi seidel, thanks for your comments and I agree completely. The example I was thinking of is junction arrays for detecting alternative splice junctions in which there is limited scope for Tm optimization. Tiling arrays, by virtue of the fact that the probes need to target particular regions, are also under similar constraints and Tm optimization is not always possible which is why I think that the issue is the variation in probe Tm. I agree that there can be other factors at work -- including labeling. Sometimes only one of the four bases is labeled, leading to some rather obvious relationships between a probe's sequence and its intensity bias.
I don't follow your arguments and there's a good chance I don't understand the chemistry but the observation that higher Tm leads to higher probe intensity given the same concentration of the target is certainly true for mRNA expression arrays.
If we're offering up guesses, I'll offer one: I think the biases are introduced at the labeling or amplification step, prior to array hybridization. Because GC rich regions have higher Tm's, the off-rate of oligos used for labeling or amplification of the genome is lower in these regions, and thus subtle differences or bias may be introduced depending on degree of amplification or other factors. I can imagine that the degree of bias between GC-rich and GC-poor regions of the genome will differ between amplification/labeling reactions based on numerous factors, and thus it can occur in either a sample of interest, or the reference sample, and thus it is possible to observe "waves" that can go in either direction at a given locus. The way I think of it is: given a low signal (i.e. some form of bias), and an amplification process, it's easy to see the bias, but hard to always reproduce it the same way.
I find the overall discussion on this matter very interesting! But I think the key problem in my understanding is how a high GC-content region can result in both a low copy number inference and a high copy number inference (Tm only cannot result in this effect, in my opinion), and I don't understand the proposition you make on the effect of labeling or amplification. Could you elaborate on that hypothesis?
Perhaps seidel can correct me but from what I understand of seidel's answer, a high GC content can go both ways depending on whether it's the reference or the test sample with high GC content. So, if the reference sample has a higher GC content than the test, we see - all other things being equal - an underestimation of copy number in the test. By contrast, if the test sample has a higher GC content than the reference, we see - all other things being equal - an overestimation of copy number in the test. @seidel am I correct?
Yes, that's almost correct, at least in terms of a larger bias coming from either the test or reference sample depending on a given condition. I found the observation that the waves could go either direction at a given locus confusing as well, but that's what they say in the paper, and the data in the figure seems to prove it. Since there's a test and a reference sample, it's a comparative assay. The sequences being compared are identical, except for the SNP itself, thus the GC content is the same between them overall. What I meant was that if there is some process that causes bias in labeling or representation of GC rich sequences over GC poor sequences, and the magnitude of that bias can occur, let's just say stochastically since we don't know the mechanism, then if the bias factor for the reference is a different magnitude than for the test sample, the direction and size of the waves would be a reflection of this difference. For example, a bias factor of 1.2 GC rich/poor in one sample, and 1.4 in another sample, after normalization and comparison, I would expect the GC rich regions to show the difference.
This 'waviness' is a systematic bias on the whole sample, resulting in waves being able to switch direction on the whole sample and being able to change their variance in the whole sample (i.e. their width on the y-axis). Reading how Log R Ratio is computed, I have the feeling this might come from the normalization by Rexpected. Has ayone tried plotting Robserved and Rexpected separately to see how they behave?
Ok this has been interesting and I think seidel has the most plausible explanation in terms of the off-rate of the oligos introducing bias through amplification or labeling, and that this can occur to either the test or to the reference.
We can borrow some information that was learned of the GC phenomenas in DNA-sequencing.
There, fragment abundance in the sample depends greatly on the GC content, but in a non-trivial way:
There is very high abundance for genomic regions with intermediate GC, but low abundance for genomic regions low or high GC content.
Moreover, these curves are different for each sample.
(These effects are believed to be largely caused by the preprocessing, and so might be relevant for library prep of arrays as well).
This phenomenon could cause waves when you take ratio of two samples: if the shape of the two curves is different,
the ratio between them will change as a function of GC. Because GC changes gradually along the genome, we get these waves.
Quick Google search for "GC waviness" throws up plenty of information; in particular http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2577347/.
This article is good but it doesn't propose a possible mechanism by which GC content causes waviness. Thanks though
It does propose however, that the quantity of DNA could be responsible for waviness. The role of GC content in this observation might be still an open question.
Where did you read this, and why is it relevant?
Where: as an example see http://www.affymetrix.com/support/help/faqs/genotyping_console/copy_number_analysis/faq_4.jsp Why: I'm interested in knowing why DNA with a high GC content would have this effect. In short, interest.
Can you define waviness? I read the little excerpt at Affy, and I still don't get it (since they don't define it either). Hi GC content leads to increased Tm of probe hybridization, perhaps it has something to do with that, although since I don't know what waviness is, I'm not sure how Tm is relevant.
As per paper referenced by Neilfws: "a genome-wide spatial autocorrelation or ‘wave’ pattern in signal intensity data was described that interferes with accurate CNVdetection", which was found to be highly correlated with GC content. It probably does have something to do with Tm