I am working with Affymetric microarray gene expression data for gene filtering. Essentially, I am interested in to extract highly correlated genes by measuring its Pearson correlation, and I got a correlation matrix by doing that. However, for filtering high correlated genes (a.k.a, features), I selected correlation coefficient arbitrarily, I don't know how to choose the optimal threshold for filtering. I am thinking about to quantify positive and negative correlated features first, then get credible figures to set up a threshold for filtering genes. Can anyone point me out how to quantify positive and negative correlated features from the correlation matrix? Is there any efficient way to select the optimal threshold for filtering highly correlated features?
reproducible data
Here is the reproducible data that I used whereas row is the number of samples, the column in the number of raw features.
> dput(my_df)
structure(list(SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01",
"Tarca_025_P1C01", "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01",
"Tarca_051_P1E03", "Tarca_063_P1F03"), GA = c(11, 15.3, 21.7,
26.7, 31.3, 32.1, 19.7, 23.6), `1_at` = c(6.06221469449721, 5.8755020052495,
6.12613148162098, 6.1345548976595, 6.28953417729806, 6.08561779473768,
6.25857984382111, 6.22016811759586), `10_at` = c(3.79648446367096,
3.45024474095539, 3.62841140410044, 3.51232455992681, 3.56819306931016,
3.54911765491621, 3.59024881523945, 3.69553021972333), `100_at` = c(5.84933778267459,
6.55052475296263, 6.42187743053935, 6.15489279092855, 6.34807354206396,
6.11780116002087, 6.24635169763079, 6.25479583503303), `1000_at` = c(3.5677794435745,
3.31613364795286, 3.43245075704917, 3.63813996294905, 3.39904385276621,
3.54214650423219, 3.51532853598111, 3.50451431462302), `10000_at` = c(6.16681461038468,
6.18505928400759, 5.6337568741831, 5.14814946571171, 5.64064316609978,
6.25755205471611, 5.68110995701518, 5.14171528059565), `100009613_at` = c(4.44302662142323,
4.3934877055859, 4.6237834519809, 4.66743523288194, 4.97483476597509,
4.78673497541689, 4.77791032146269, 4.64089637146557), `100009676_at` = c(5.83652223195279,
5.89836406552412, 6.01979203584278, 5.98400432133011, 6.1149144301085,
5.74573650612351, 6.04564052289621, 6.10594091413241)), class = "data.frame", row.names = c("Tarca_001_P1A01",
"Tarca_013_P1B01", "Tarca_025_P1C01", "Tarca_037_P1D01", "Tarca_049_P1E01",
"Tarca_061_P1F01", "Tarca_051_P1E03", "Tarca_063_P1F03"))
my attempt:
Here is my attempt to get the Pearson correlation matrix and filter out highly correlated features (here I just used correlation coefficient which was chosen arbitrarily):
target <- my_df$GA
raw_feats <- my_df[,-c(1:2)]
corr_df = do.call(rbind,
apply(raw_feats, 2, function(x){
temp = cor.test(target, as.numeric(x),
alternative = "two.sided",method = "pearson")
data.frame(t = temp$statistic, p = temp$p.value,
cor_coef=temp$estimate)
}))
then I selected correlation coefficient arbitrarily as the default threshold for filtering highly correlated features.
indx <- which(corr_df$cor_coef > 0.0785 | corr_df$cor_coef<=-0.01)
mydf_new <- my_df[indx,]
I think doing this way is not accurate. any idea?
I am curious about how to quantify positive and negative correlated features, then find out optimal threshold value for filtering. How can I make this happen? any efficient way to quantify pos/neg correlated features? How can I select optimal correlation coefficient values as the threshold for filtering? any thought? Thanks
@Ahill :
Could you provide possible coding demonstration what you were pointing out here? Any possible elaboration? How can I quantify a number of pos/neg correlated features? which posts were you referring to me? Can you specify
how to select a threshold that provides an estimate and bound on the false-positive rate
? Thanks a lot