I have a data.frame with the correlations between OTUs and genes. These correlations will allow me to construct genomes. This data.frame has 1105854 rows.
var1 var2 corr
1 OTU3978 UniRef90_A0A010P3Z8 0.846
2 OTU4011 UniRef90_A0A010P3Z8 0.855
3 OTU4929 UniRef90_A0A010P3Z8 0.829
4 OTU4317 UniRef90_A0A011P550 0.850
5 OTU4816 UniRef90_A0A011P550 0.807
6 OTU3902 UniRef90_A0A011QPQ2 0.836
7 OTU3339 UniRef90_A0A011RKI6 0.835
8 OTU1359 UniRef90_A0A011RLA7 0.801
9 OTU2085 UniRef90_A0A011RLA7 0.843
10 OTU3542 UniRef90_A0A011RLA7 0.866
11 OTU0473 UniRef90_A0A011TDE1 0.807
I use the igraph library to build a graph object.
g<-graph.data.frame(df)
Then, I want to extract components of this graph in order to construct genomes : I mean, one component will correspond to one genome.
I tried this command : genomes<-split(names(V(g)), components(g)$membership)
It gives me back several components, for example :
> genomes[[4]]
[1] "OTU2417" "UniRef90_A0A076H0Q4" "UniRef90_A0A2E8T3F8"
[4] "UniRef90_G5ZY43"
I check the OTU and the different genes of each component thanks to my OTUs table and thanks to the EMBL-EBI database for the genes. I can determine if each reconstructed genome is meaningful.
I also checked the documentation, and I found many other community detection methods : edge-betweenness, louvain, multi-level ... I would like to know what is the main difference between the command line I used ( which gives me back pretty meaningful components) and these algorithms (which also give me components) ?
Thanks
Could the connected components (subgraph) of the graph give me back reconstructed genomes ? In your opinion, I should first extract each subgraph, and then, apply clustering algorithms on each of them in order to "improve" the exactness of the reconstructed genomes?
In your context, connected components represent groups whose members have no correlation with members of any of the other groups. Whether that meets your requirements for calling a group a genome is for you to decide. However if you want to further partition the connected components (for example you think they represent more than one genome) then you can apply a clustering algorithm to try and reveal further structure. My point was that applying almost any clustering algorithm to the whole graph is pointless because this will return the connected components.
Thanks for your reply. The biggest component I get is always the first one (whatever the dataset I import) . So I am going to apply clustering on this one.
I use :
genomes<-split(names(V(g)), components(g)$membership)
, and I extract the first component withbig_one<-genomes[[1]]
.Is there a way to get back an igraph object only for this component?
Check the decompose() function.