Hi:
I have Affymetrix gene level expression matrix (genes in the rows and sample ID on the columns), and I tried to quantify the variation of the expressed genes by using coefficient of variation (CV)
method. However, I found a pretty unusual value of CV
when I made a plot and realized something wrong in gene expression data. Here is what I did in R for computing CV
:
SD <- apply(eset_HTA20,1, sd)
CV <- base::sqrt(exp(SD^2)-1)
but I tried to see the value range of CV
, I found something strange:
> summary(CV)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.04753 0.12946 0.16494 0.20181 0.22925 15.00777
apparently, max(CV)
should be less than or equal to 1, but I got 15.00777
, which means that something wrong with gene expression data. The gene expression data was already preprocessed (normalized, done with background correction). I don't where this problem comes from.
why I use CV
:
I used CV to measure the variation of genes which are expressed and want to keep the genes which show high variation, but the value range of CV is not reasonable here.
How can I track down this problem? why I have CV
value with more than one? how can I correct this irregularity? any strategy? any idea to fix the potential problem here?
The Coefficient of Variation is the ratio of the standard deviation to the mean (See Wikipedia).
Thus
I don't really understand what is going to in the code you quote, other than it looks like someone is trying to get the SD on the original scale (i.e. not the log scale).
If your expression data were approximately normally distributed, then you would not expect to have a CV greater than 1, because that would imply that 60% of your data was <0. ( which would be extrememly low signal on a microarray). This is assuming your data is on a log scale. Otherwise, the data is definately not normally distributed and all bets are off.
Apparently it can be greater than 1. Can you prove any mathematical restrictions for the CV you computed? The distribution, however, seems not normal
I am not sure my understanding is correct, as far as I know, the value of
CV
should be (0,1), please correct me if I am wrong. how people from computational biology use and interpretCV
in the genomic analysis? Do you have any concrete idea?I just checked
max
andmin
value in gene expression data:I just used
CV
formula in R and do raw computation. I am open to hearing your suggestion. ThanksCould you elaborate your point about to make here? How can I correctly apply
CV
for gene filtering?Plus, any better idea to see the distribution of gene expression data? I used PCA, boxplot, tried also density plot from limma, couldn't have a better picture of the gene expression data here. Any guide or possible idea from you?
Did you try a histogram of the CV you computed? My point above was that CV is not restricted by 1.
yes, I did, I tried barplot which kinda makes sense to me. here is the
Please see How to add images to a Biostars post to add your images properly. You need the direct link to the image, not the link to the webpage that has the image embedded (which is what you have used here).
I've fixed it for you this time.
how many replicates do you have?
I have Affymetrix gene-level expression data matrix (32830 of genes, 735 of samples )(a.k.a, patients). Do you have any idea on this problem? any better strategy to quantify the variation of genes that expressed for the sake for gene filtering?
Isn’t cv=sd /mean? Where did your equation come from?
I learned it from this thread.
I want to confirm one important thing; the upper bound of CV is always restricted between (0,1), or can be greater than one? How CV works in Affymetrix microarray data? can you point me out if I am wrong?
It’s just the sd/mean - it could be anything. I think you should use that equation unless your data has been normalized and transformed.
so the value of
CV
is not restricted with (0,1), could you confirm that?Yes the coefficient of variation can take any positive value. As explained before, by definition, the coefficient of variation is the ratio standard deviation divided by mean of the sample. It is a descriptive statistics and because it is unit free it can allow comparison of variability between variables on different scales/units. However, in principle, it should only be applied to variables with all positive values which precludes its use for log data with a mixture of positive and negative values.
Is there any solution to calculate "cv" for the expression matrix with negative value?