Question

How To Turn A Two Column Data File Containing Pairs Into A series of clusters

0

Entering edit mode

6.5 years ago

stefano.campanaro • 0

Hi All, I intersected to data sets and I have a tabulated file with two columns like:

AA     BB

BB      CC

CC      DD

BB      AA

EE      FF

FF      GG

GG      HH

II      JJ

JJ      II

II      KK

...     ...

and I would like to convert "one-to.one interactions" in clusters considering that AA interacts with BB, BB with CC and CC with DD (so AA, BB, CC and DD form a cluster). Similarly EE, FF, GG, HH form another cluster but none of these elements interact with elements of the first cluster and so on. I would like to obtain something like

AA BB CC DD 
EE FF GG HH
II JJ KK

...

Would you please help me how I can do that?

software error • 1.5k views

ADD COMMENT • link updated 6.5 years ago by JC 13k • written 6.5 years ago by stefano.campanaro • 0

0

Entering edit mode

Question : If somewhere in the file you have BB GG, you want to get a single cluster (AA, BB, CC, DD, EE, FF, GG and HH) ?

I would like to obtain something like AA BB CC DD EE FF GG HH II JJ KK

I don't understand this line

ADD REPLY • link 6.5 years ago by Bastien Hervé 5.9k

0

Entering edit mode

Why is this tagged as a software error question ?

ADD REPLY • link 6.5 years ago by Jean-Karim Heriche 27k

score 1 · Answer 1 · 2018-06-01

The data represents a graph in edge list format, i.e. each line is an edge of the graph specifying the two nodes that are connected. What you call clusters seems to be the connected components of this graph. So read the data into a graph structure then extract the connected components, e.g. in R with the igraph package, something like this (untested):

edge.list <- as.matrix(read.table("edge_list.txt",...)) # read the file as appropriate, turn data into a two-column matrix for use by igraph
G <- graph_from_edge_list(edge.list, directed = FALSE)
clusters <- components(G)

score 0 · Answer 2 · 2018-06-01

In pseudo code, if you want to try to write it

Initialize an 2D array : 2Darray

Import your file in a dataframe

Sort the dataframe by column one and column two, to get something like :

AA BB

BB AA

BB CC

CC DD

For each line of your dataframe

If it's the first line of the datafame, create and array and append first element and second element of the line
Else, does the first element exist in array ?
- If yes, append array with the second element
- Else, append 2Darray with array, reinitilize array, append array with first element and second element

At the end in 2Darray you will have your clusters.

Untested

score 0 · Answer 3 · 2018-06-01

You can use perl or python to get that:

#!/usr/bin/perl

use strict;
use warnings;

my %g = ();
my $a = undef;
my $b = undef;
my $net = 0;
my $fst = 1;

while (<>) {
  chomp;
  ($a, $b) = split (/\s+/, $_);
  if ($fst == 1) {
    #warn "first iteration, adding $a - $b in net $net\n";
    $g{$net}{$a} = 1;
    $g{$net}{$b} = 1;
    $fst = 0;
  }
  else {
    my $new = 1;
    for (my $n = 0; $n <= $net; $n++) {
      if (defined $g{$n}{$a}) {
        if (defined $g{$n}{$b}) {
          #warn "$a - $b exist in net $n\n";
          $new = 0;
          last;
        }
        else {
          #warn "$a exist in net $n, adding $b\n";
          $g{$n}{$b} = 1;
          $new = 0;
          last;
        }
      }
      else {
        if (defined $g{$n}{$b}) {
          #warn "$b exist in net $n, adding $a\n";
          $g{$n}{$a} = 1;
          $new = 0;
          last;
        }
      }
    }
    if ($new == 1) {
      $net++;
      #warn "$a - $b not seen in other nets, adding in a new net $net\n";
      $g{$net}{$a} = 1;
      $g{$net}{$b} = 1;
    }
  }
}

#warn "writting nets\n";
foreach $net (sort keys %g) {
  print join "\t", sort keys %{ $g{$net} };
  print "\n";
}

Example:

$ perl graph.pl < data.txt
AA      BB      CC      DD
EE      FF      GG      HH
II      JJ      KK