Question

How to quantify positive and negative correlated genes from gene expression data?

0

Entering edit mode

6.0 years ago

Jurat Shahidin ▴ 100

I am working with Affymetric microarray gene expression data for gene filtering. Essentially, I am interested in to extract highly correlated genes by measuring its Pearson correlation, and I got a correlation matrix by doing that. However, for filtering high correlated genes (a.k.a, features), I selected correlation coefficient arbitrarily, I don't know how to choose the optimal threshold for filtering. I am thinking about to quantify positive and negative correlated features first, then get credible figures to set up a threshold for filtering genes. Can anyone point me out how to quantify positive and negative correlated features from the correlation matrix? Is there any efficient way to select the optimal threshold for filtering highly correlated features?

reproducible data

Here is the reproducible data that I used whereas row is the number of samples, the column in the number of raw features.

  > dput(my_df)
structure(list(SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01", 
"Tarca_025_P1C01", "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01", 
"Tarca_051_P1E03", "Tarca_063_P1F03"), GA = c(11, 15.3, 21.7, 
26.7, 31.3, 32.1, 19.7, 23.6), `1_at` = c(6.06221469449721, 5.8755020052495, 
6.12613148162098, 6.1345548976595, 6.28953417729806, 6.08561779473768, 
6.25857984382111, 6.22016811759586), `10_at` = c(3.79648446367096, 
3.45024474095539, 3.62841140410044, 3.51232455992681, 3.56819306931016, 
3.54911765491621, 3.59024881523945, 3.69553021972333), `100_at` = c(5.84933778267459, 
6.55052475296263, 6.42187743053935, 6.15489279092855, 6.34807354206396, 
6.11780116002087, 6.24635169763079, 6.25479583503303), `1000_at` = c(3.5677794435745, 
3.31613364795286, 3.43245075704917, 3.63813996294905, 3.39904385276621, 
3.54214650423219, 3.51532853598111, 3.50451431462302), `10000_at` = c(6.16681461038468, 
6.18505928400759, 5.6337568741831, 5.14814946571171, 5.64064316609978, 
6.25755205471611, 5.68110995701518, 5.14171528059565), `100009613_at` = c(4.44302662142323, 
4.3934877055859, 4.6237834519809, 4.66743523288194, 4.97483476597509, 
4.78673497541689, 4.77791032146269, 4.64089637146557), `100009676_at` = c(5.83652223195279, 
5.89836406552412, 6.01979203584278, 5.98400432133011, 6.1149144301085, 
5.74573650612351, 6.04564052289621, 6.10594091413241)), class = "data.frame", row.names = c("Tarca_001_P1A01", 
"Tarca_013_P1B01", "Tarca_025_P1C01", "Tarca_037_P1D01", "Tarca_049_P1E01", 
"Tarca_061_P1F01", "Tarca_051_P1E03", "Tarca_063_P1F03"))

my attempt:

Here is my attempt to get the Pearson correlation matrix and filter out highly correlated features (here I just used correlation coefficient which was chosen arbitrarily):

target <-  my_df$GA
raw_feats <- my_df[,-c(1:2)]

corr_df = do.call(rbind,
                  apply(raw_feats, 2, function(x){
                      temp = cor.test(target, as.numeric(x),
                                      alternative = "two.sided",method = "pearson")
                      data.frame(t = temp$statistic, p = temp$p.value,
                                 cor_coef=temp$estimate)
                  }))

then I selected correlation coefficient arbitrarily as the default threshold for filtering highly correlated features.

indx <- which(corr_df$cor_coef > 0.0785 | corr_df$cor_coef<=-0.01)
mydf_new <- my_df[indx,]

I think doing this way is not accurate. any idea?

I am curious about how to quantify positive and negative correlated features, then find out optimal threshold value for filtering. How can I make this happen? any efficient way to quantify pos/neg correlated features? How can I select optimal correlation coefficient values as the threshold for filtering? any thought? Thanks

R correlation gene filtering gene-expression • 2.5k views

ADD COMMENT • link 6.0 years ago by Jurat Shahidin ▴ 100

score 2 · Answer 1 · 2019-07-20

This question has come up before - have you taken a look at the "Similar posts" list on the right hand side of this post? Any threshold you pick will ultimately be arbitrary, but what you want to do is select a threshold that provides you an estimate and bound on the the false-positive rate (type I error) among pairs of genes that exceed the threshold. Your cor.test() call provides a p-value that estimates that - the probability of seeing a Pearson correlation as extreme under a null hypothesis. So filtering on corr_df$p < X where X is the false-postive rate that you are willing to tolerate would be one option to find either positive or negatively correlated pairs that are unlikely to be due to chance variation. Filtering according to the corr_df$cor_coef value as you've done above doesn't provide any clear estimate of the false positive rate among pairs that pass your filter, and you should not select upper and lower filter bounds like (0.0785, -0.01) that are not symmetrical around zero unless you have some very specific reason for that. Other approaches mentioned in the "Similar posts" list including doing a randomization test to estimate a threshold that would give you a false-positive rate that you choose: that's an alternative to using cor.test() that would make sense if your data don't match the assumptions of the cor.test method.