Any strategy to find out the reason of having unusual Coefficient of Variation (CV) for gene expression data?
0
0
Entering edit mode
5.4 years ago

Hi:

I have Affymetrix gene level expression matrix (genes in the rows and sample ID on the columns), and I tried to quantify the variation of the expressed genes by using coefficient of variation (CV) method. However, I found a pretty unusual value of CV when I made a plot and realized something wrong in gene expression data. Here is what I did in R for computing CV:

SD <- apply(eset_HTA20,1, sd)
CV <- base::sqrt(exp(SD^2)-1)

but I tried to see the value range of CV, I found something strange:

> summary(CV)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.04753  0.12946  0.16494  0.20181  0.22925 15.00777

apparently, max(CV) should be less than or equal to 1, but I got 15.00777, which means that something wrong with gene expression data. The gene expression data was already preprocessed (normalized, done with background correction). I don't where this problem comes from.

why I use CV:

I used CV to measure the variation of genes which are expressed and want to keep the genes which show high variation, but the value range of CV is not reasonable here.

How can I track down this problem? why I have CV value with more than one? how can I correct this irregularity? any strategy? any idea to fix the potential problem here?

microarray gene-expression CV error • 3.2k views
ADD COMMENT
1
Entering edit mode

The Coefficient of Variation is the ratio of the standard deviation to the mean (See Wikipedia).

Thus

M <- apply(eset_HTA20, 1, mean)
SD <- apply(eset_HTA20, 1, sd)
CV <- SD/M

I don't really understand what is going to in the code you quote, other than it looks like someone is trying to get the SD on the original scale (i.e. not the log scale).

If your expression data were approximately normally distributed, then you would not expect to have a CV greater than 1, because that would imply that 60% of your data was <0. ( which would be extrememly low signal on a microarray). This is assuming your data is on a log scale. Otherwise, the data is definately not normally distributed and all bets are off.

ADD REPLY
0
Entering edit mode

Apparently it can be greater than 1. Can you prove any mathematical restrictions for the CV you computed? The distribution, however, seems not normal

ADD REPLY
0
Entering edit mode

I am not sure my understanding is correct, as far as I know, the value of CV should be (0,1), please correct me if I am wrong. how people from computational biology use and interpret CV in the genomic analysis? Do you have any concrete idea?

Can you prove any mathematical restrictions for the CV you computed?

I just checked max and min value in gene expression data:

> max(expr_mat)
[1] 14.28363
> min(expr_mat)
[1] 0.9365626

I just used CV formula in R and do raw computation. I am open to hearing your suggestion. Thanks

ADD REPLY
0
Entering edit mode
> a <- rnorm(1000,0,10)
> sqrt(exp(sd(a)^2)-1)
[1] 1.433914e+23
ADD REPLY
0
Entering edit mode

Could you elaborate your point about to make here? How can I correctly apply CV for gene filtering?

Plus, any better idea to see the distribution of gene expression data? I used PCA, boxplot, tried also density plot from limma, couldn't have a better picture of the gene expression data here. Any guide or possible idea from you?

ADD REPLY
0
Entering edit mode

Did you try a histogram of the CV you computed? My point above was that CV is not restricted by 1.

ADD REPLY
0
Entering edit mode

yes, I did, I tried barplot which kinda makes sense to me. here is the barplot

ADD REPLY
1
Entering edit mode

Please see How to add images to a Biostars post to add your images properly. You need the direct link to the image, not the link to the webpage that has the image embedded (which is what you have used here).

I've fixed it for you this time.

ADD REPLY
0
Entering edit mode

how many replicates do you have?

ADD REPLY
0
Entering edit mode

I have Affymetrix gene-level expression data matrix (32830 of genes, 735 of samples )(a.k.a, patients). Do you have any idea on this problem? any better strategy to quantify the variation of genes that expressed for the sake for gene filtering?

ADD REPLY
1
Entering edit mode

Isn’t cv=sd /mean? Where did your equation come from?

ADD REPLY
0
Entering edit mode

I learned it from this thread.

ADD REPLY
0
Entering edit mode

I want to confirm one important thing; the upper bound of CV is always restricted between (0,1), or can be greater than one? How CV works in Affymetrix microarray data? can you point me out if I am wrong?

ADD REPLY
0
Entering edit mode

It’s just the sd/mean - it could be anything. I think you should use that equation unless your data has been normalized and transformed.

ADD REPLY
0
Entering edit mode

so the value of CV is not restricted with (0,1), could you confirm that?

ADD REPLY
1
Entering edit mode

Yes the coefficient of variation can take any positive value. As explained before, by definition, the coefficient of variation is the ratio standard deviation divided by mean of the sample. It is a descriptive statistics and because it is unit free it can allow comparison of variability between variables on different scales/units. However, in principle, it should only be applied to variables with all positive values which precludes its use for log data with a mixture of positive and negative values.

ADD REPLY
0
Entering edit mode

Is there any solution to calculate "cv" for the expression matrix with negative value?

ADD REPLY

Login before adding your answer.

Traffic: 2427 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6