Question

significance of correlation in biological data

2

Entering edit mode

7.8 years ago

BrunoGiotti ▴ 120

Hi there,

I looked up this question and i didn't seem to find anything like this. I hope someone can enlighten my very dim understanding of statistics. So, i have a microarray dataset (i.e. around 30,000 probes for 20 samples) and would like to calculate p-values based on the correlation between each gene with each other gene. So i looked up a bit on how to calculate statistical significance of correlation and it seems that for this test the null hypothesis is that the correlation between two variables should be 0. This is however not the case for biological data? If i plot the distribution of the correlation of all the genes times all the genes i get a normal distribution with mean correlation around 0.5. Given this, i shouldn't really consider r 0.5 as a meaningful correlation shouldn't I? Is there a way to calculate significant correlation by setting the null hypothesis to the mean correlation of the initial dataset? Does this make any sense?

Thanks very much in advance!

microarray correlation p-value • 3.7k views

ADD COMMENT • link updated 7.8 years ago by mforde84 ★ 1.4k • written 7.8 years ago by BrunoGiotti ▴ 120

score 1 · Answer 1 · 2017-07-08

1

Entering edit mode

7.8 years ago

Jean-Karim Heriche 27k

Correlation measures the strength of the linear relationship between two data sets. Genes with a positive correlation have a similar behaviour across the samples, i.e. they go up or down together and genes with a negative correlation have opposite behaviours, i.e. when one goes up the other goes down. You use correlation for example to find genes possibly co-regulated. However, how strong the correlation should be for you to consider two genes co-regulated is arbitrary and can depend on the biological context.
What you're trying to do in testing whether the correlation is 0 is reduce the number of false positives (genes you call correlated when they are not). The p-value should be interpreted as the probability of seeing a correlation at least as strong (in the one-sided case) as the one observed if expression of the two genes was not correlated. Note that given that you test many gene pairs, you're bound to have some significant results just by chance. To mitigate this effect you use a correction for multiple testing. Statistical significance is also often biologically irrelevant. Two genes could be highly correlated yet the correlation could be not significant while the opposite can also happen: two genes can be weakly correlated but the correlation highly significant. Now, it is possible to test for the significance of the difference between two correlation values just like for any other parameter describing populations. However, it has to be transformed into a z-score using the Fisher transformation (because correlation is bound to [-1,1] and its variance gets smaller close to -1 and 1). Why would you want to do this ? You're going to test whether a correlation of 0.6 is actually 0.5. Does that have a biological relevance ? All those pairs with a correlation of 0.5 may be relevant to the biology you're interested in but no statistical test is going to tell you this. If you want the most correlated genes pairs, then just rank the pairs by their correlation and take the top N that satisfy your requirement.

ADD COMMENT • link 7.8 years ago by Jean-Karim Heriche 27k

1

Entering edit mode

Thanks for you answer. Im not sure however this is answering my question. I'll try to be clearer. I do not have different conditions, i.e. control vs disease. My dataset is actually a time course of 20 samples and i don't intend to compare the correlation between different set of time points. If I generate a matrix of correlations coefficients calculated by correlating all the gene pairs (i.e. a matrix of ~20,000 cols x ~20,000 rows) i can see that the mean distribution of the correlations is 0.5. Now, i have been performing network analysis using this matrix to identify clusters of co-expressed genes. Prior to this i'd like to threshold the data to include only the 'significant' correlations.So, at which point should i consider a correlation 'significant' given that the majority of the genes seem to be correlated with an r of 0.5? I've been suggested to permute the data and compare the distribution of the permuted data vs the real one to identify an appropriate threshold. However, is this the way to go? Shouldnt i seek to calculate p-values based on the distribution of the real data? I hope i made myself clear. And thanks again for your answer.

Cheers

ADD REPLY • link 7.8 years ago by BrunoGiotti ▴ 120

1

Entering edit mode

So if you wanted to know how to threshold your correlation matrix, why didn't you ask about this first ? You were asking about p-values for correlations instead. For clustering purposes, thresholding the matrix is meant to reinforce the cluster structure. In this case, what you want to do is remove correlations close to 0. What close to 0 means is up to you, one possibility is to keep those correlations with a low p-value. Permutations can tell you the likelihood of a given correlation value if the genes were randomly assigned profiles from your data. Here, p-values are of little help for assessing relevance. If you have large clusters, the probability of getting high correlation values in the permutations will be high. If you remove high correlations that have high p-value, you could end up destroying the cluster structure you're trying to identify. I think you should be guided by the biology behind your data. If you think there are too many genes that are highly correlated, try to understand why.

ADD REPLY • link 7.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

ok thanks so i guess permutation is the way to go. Cheers.

ADD REPLY • link 7.8 years ago by BrunoGiotti ▴ 120

score 1 · Answer 2 · 2017-07-09

the permuted data is real data. just real data which has been subsampled with N samples for Y iterations to estimate population parameters for a random distribution of N samples. effectively that is the baseline distribution for a random process.

i recently wrote an R program that does just this. you'll have to tinker with it so it works for your data, but it's someplace to start.

library(rsgcc)
library(SNFtool)

args<-commandArgs(TRUE)
mrna_file <- args[1]
omics_file <- args[2]
group_name <- args[3]
query <- args[4]
try(mirna_file <- args[5])
ifis.na(mirna_file)){
    try(n_permute <- args[5])
}else{
    try(n_permute <- args[6])
}
ifis.na(n_permute)){n_permute=2000}

make_unique <- function(input.data) {

    require(org.Hs.eg.db)
    input.data <- data.frame(cbind(input.data, 
        entrez=as.numeric(mapIds(org.Hs.eg.db, 
        keys=gsub("\\..*","",rownames(input.data)), 
        column="ENTREZID", keytype="ENSEMBL",multiVals="first"))))
    input.data <- input.data[!is.na(input.data$entrez),]

    id.dup <- input.data[duplicated(input.data$entrez) 
        | duplicated(input.data$entrez, 
        fromLast = TRUE),ncol(input.data)]
    data.dup <- as.matrix(input.data[duplicated(input.data$entrez) |
        duplicated(input.data$entrez, fromLast = TRUE),])
    ezid.dup <- unique(id.dup)
    data.unique <- input.data[!input.data$entrez %in% id.dup,]
    rownames(data.unique) <- data.unique$entrez
    data.unique$entrez = NULL
    data.dup2 <- lapply(ezid.dup, function(x) { 

        expr <- data.dup[id.dup == x,]
        if(is.matrix(expr)){

            sd.values <- apply(expr[,1:ncol(input.data)-1], 1, sd)
            mean.values <- apply (expr[,1:ncol(input.data)-1], 1, mean)
            CV.values <- sd.values/mean.values
            CV.values <- gsub("NaN","0",CV.values)
            expr <- expr[which(CV.values == max(CV.values))[[1]],]

        } else {

            expr

        }

    })

    merger <- data.frame(do.call(rbind, data.dup2))
    rownames(merger) <- merger$entrez
    merger$entrez = NULL
    eset <- as.matrix(rbind(data.unique, merger))
    return(eset)

}

run_tests <- function(data, feature_type, n){

    output = NULL
    for (x in 2:dim(data)[[1]]){

        #print(paste("current step",x-1,"out of",dim(data)[[1]]-1,sep=" "))
        corpair <- cor.pair(c(1,x), GEMatrix = as.matrix(data), 
            rowORcol = "row", cormethod = "PCC", pernum = n,
            sigmethod = "two.sided")
        direction_corr="neg"
        if(corpair$cor > 0){

            direction_corr="pos"

        }
        output <- rbind(output, c(rownames(data)[[1]],
            rownames(data)[[x]], feature_type, corpair$cor, 
            abs(corpair$cor), direction_corr, corpair$pvalue))

    }

    colnames(output) <- c("gene","feature","feature_type",
        "correl", "abs_correl", "direction", "pvalue_2k_permutation")
    write.table(output, file=paste(gene_name_query, "_", feature_type, "_", 
        group_name, ".csv",sep=""), row.names = FALSE, sep="\t")

}

mrna <- read.table(mrna_file, header=TRUE, row.names=1, sep="\t")
mrna <- make_unique(mrna)
genes <- mapIds(org.Hs.eg.db, keys=rownames(mrna),
    column="SYMBOL", keytype="ENTREZID",multiVals="first")
rownames(mrna) <- genes
gene_name_query <- mapIds(org.Hs.eg.db, keys=gsub("\\..*","",query),
    column="SYMBOL", keytype="ENSEMBL",multiVals="first")
query_row_num <- which(rownames(mrna) == gene_name_query)

group <- as.factor(unlist(read.table(omics_file, header=FALSE, sep="\t")))
if(group_name != "all"){

    mrna <- mrna[,which(group == group_name)]

}

query_data <- data.frame(mrna[query_row_num,])
colnames(query_data) <- rownames(mrna)[[query_row_num]]
query_data <- t(query_data)

ifis.na(mirna_file)){

    data <- rbind(query_data, mrna[-c(query_row_num),])
    run_tests(data,"mrna",n_permute)

}else{

    mirna <- read.table(mirna_file, header=TRUE, row.names=1, sep="\t")
    if(group_name != "all"){
        mirna <- mirna[,which(group == group_name)]
    }
    data <- rbind(query_data, mirna)
    tdata <- t(data)
    ntdata <- standardNormalization(tdata)
    data <- t(ntdata)
    run_tests(data,"mirna", as.numeric(n_permute))

}

If this isn't helpful, I can put a working example up on github. But that won't be until next week. Just let me know.