Hello,
I have a dataset with the correlations between genes and OTUs. I want to plot these correlations with the igraph
library in R in order to know what genes are correlated with which OTU. Then, I will extract the different components (each component should represent a genome).
My dataset is very huge : I can't keep all the correlations (in the range [-1,1]), which gives a huge dataset (817.000*817.000 correlations). So, I want to select a threshold : is there a good way to set a good threshold? I mean, if I only keep the correlations > 0.9 , is it meaningful? I keep more than 58 million correlations if I do that. That creates 9152 components.
Another point is to know if I should only keep the correlations OTU-gene? Is it still meaningul to keep the correlations OTU-OTU and gene-gene? If I only keep the correlations OTU-gene > 0.8 , I keep more than 1,1 million correlations. That creates only 89 components.
Thanks
What are you trying to achieve? Even after stringent filtering you will most likely still have too much data for useful visualization. Graph visualizations quickly degenerate into useless hairballs when the number of nodes grows. To get genes correlated with OTUs, you could try a (bi)clustering approach.
Actually, I don't really want to get a useful visualization. I only want to get the components : I suppose that these ones will be compound of OTUs (one or more) with several genes. After that, I will consider these components as genomes.
So this can be framed as a clustering problem.
I know what you mean. Another point concerns the
components
found byigraph
. The first one is much bigger than the others (no matter the data.frame of correlations I import) : is it meaningful to take this component aside in order to reclusterise it?Yes. Connected components are the first level of structure in the graph but in each one you may have weak connections due to noise so it is common to apply clustering to connected components separately.
Is there a
igraph
function which allows to apply clustering to connected components separately?Just extract the submatrix corresponding to each connected component and use it as input to your clustering algorithm of choice.
Actually, I extract the giant component with
dg<-decompose.graph(g)
l<-get.data.frame(dg[[1]])
. It returns the data.frame with the correlation between my OTU and genes corresponding to the giant component. If I try to make a graph on this data.frame, does it necessarly return me the same connected component (the giant component) ?I found the package
biclust
which could be interesting to apply clustering on my giant component. The problem is that I have my data in the form of data.frame rather than matrix. Do you know this package, and if yes, if it could be suitable to my data.set and my analysis?I don't know this package but looks like a good place to start.
The dataframe I extract from the giant component looks like this (it is only the head) :
I used the
library(reshape2)
to put it in the form of matrix :matrix=acast(df, var1~var2, value.var="corr")
. The rows correspond tovar1
and the columns tovar2
. Is it meaningful to proceed like this (should I create a matrix with all the OTUXXX and UniRef90_XXX for both rows and columns?)By the way, when I apply different clustering methods (biclustering, kmeans) , I will get only one cluster, which corresponds to my giant component. I don't know how to apply a "real" clustering on it.
You have to apply clustering to each connected component separately. You can't get only one cluster with k-means because it will always find the number of clusters provided as input parameter.
That's what I meant. I applied clustering on the matrix which corresponds to my giant component, and I get back only one cluster. Maybe I don't understand your draft.. I can extract each connected component, make the matrix for each of them and then apply clustering. When you say " You have to apply clustering to each connected component separately." : that's what I did, I applied clustering on the giant component and not on others because I don't want to apply clustering on the others (but I could do it)
Yes this is the matrix you should be using as input. I thought your data was already like this.
No unless you want to consider OTUs and Unirefs as equivalent.