I have an Affymetrix gene expression matrix where I intend to do gene filtering. However, I managed to find a correlation between the gene expression matrix and target pheno data. To do so, I tried to set a different threshold to keep high correlated genes in my experiment but didn't find best worked out a solution.
I am wondering is there any efficient way to select a threshold for gene filtering? any possible idea would be appreciated.
reproducible data:
I produced reproducible example for gene expression data and pheno data down below:
persons_df <- data.frame(person1=sample(1:20,10, replace = FALSE),
person2=as.factor(sample(10)),
person3=sample(1:25,10, replace = FALSE),
person4=sample(1:30,10, replace = FALSE),
person5=as.factor(sample(10)),
person6=as.factor(sample(10)))
row.names(persons_df) <-letters[1:10]
whereas, in persons_df
, different features (a.k.a, genes) in row-wise and different persons in column-wise are given.
and I have pheno metadata down below:
age_df <- data.frame(personID= colnames(persons_df),
age=sample(1:50, 6 , replace = FALSE))
my objective:
I want to keep the features (a.k.a, genes in the rows) which show a high correlation with age
from age_df
my solution for filtering:
corr_df = do.call(rbind,
apply(persons_df, 1, function(x){
temp = cor.test(age_df$age, as.numeric(x))
data.frame(t = temp$statistic, p = temp$p.value,
cor_coef=temp$estimate)
}))
indx <- which(abs(corr_df$p)>0.15 &(upper.tri(corr_df$cor_coef)), arr.ind = TRUE)
indx <- unique(c(indx[,1], indx[,2]))
corr_genes <- eset_HTA20[indx,]
but when I subset original gene expression matrix, I left empty output. Is there any problem with my indexing, and subsetting of the gene expression matrix? can anyone point me out my mistake if there is any?
question:
what is the best strategy to keep highly correlated genes? how can I pick up descent threshold such as to take p-value, or both t-value, and p-value as my threshold for filtering? can anyone guide me on how to set up a reasonable threshold for gene filtering? thanks a lot
Thanks for your reply. I am experimenting gene expression matrix public dataset from here. Could you point me out any concrete strategies or approach I could try for gene filtering task? Do you think my way of indexing and subsetting expression matrix is problematic? what's the proper of setting a threshold (either pick p-value, t-value or both or take correlation coefficient)? any feasible approach to do this? thanks
I think that you may want to look at this line... it does not seem to do what you are expecting it to do(?)
Evaluate each part separately and you'll see. Also, with this,
abs(corr_df$p)
, you are taking absolute p-values > 0.15 - are you sure that you want to do that?Dear Kevin:
Thanks for your help. What I want to do is to keep the genes which have a high correlation with
age
. Any correction about what I've done with correlation analysis, indexing, and subsetting original gene expression matrix? your possible instruction would be appreciated a lot.Well, just take a look at that line to which I pointed you... it is not doing what you want to do.
Just on the first part of it (
abs(corr_df$p)>0.15
), there is no need to obtain the absolute value of a p-value because one can never have a negative p-value. Also, by filtering p-value > 0.15, you are filter including the correlations that are not statistically significant.The second part of it,
upper.tri()
, is a function usually applied to a data matrix of, e.g., correlation values, and not to a summary table of stats values like you are using it (?).To get the statistically significant correlates, you just need to do:
Otherwise good coding, as pointed out by Genomax.
Dear Kevin:
Thanks for your point. Could you give me a piece of opinion about the way of my above correlation analysis for gene filtering? Do you think what's the decent way for gene filtering here (I used Affymetrix gene expression data from here)? What else I can try for gene filtering? any possible idea? thank you again for your help.
There really are no standards.... no standards in anything in bioinformatics.
I do not know what is your end goal, so, I cannot comment much further. However, we have addressed your question:
Just use p-value <= 0.05 for starting off. These represent the statistically significant correlates.