You probably figured it out (or had it figured out) by now. In case you'd still need some enlightenment to get started the next time you face a similar problem, here you go.
If I got the problem statement right, you want all unordered pairs of values of column B for each value of column A, keeping everything in the same order as they appear.
This involves 2 steps:
- Gather all the B values for a given A entries
- Given a list of B values, build a list of pairs shifting indices forward twice:
- (b_1,b_2), (b_1,b_3), ... , (b_1,b_n) ; (b_2,b_3) , ... , (b_3,b_n) ; ... ; (b_n-1, b_n)
Step 1 is easily achieved with an associative array (as AWK aptly names them), also known as a hash (in Perl), dictionary (in Python) or map (in most functional languages).
Step 2 amounts to three embedded loops. Looping over unique A entries, looping twice over B values to get a all pairs (shifting the indices to avoid getting (b_m,b_n) for m >= n).
If the amount of records were huge and granted that keys (column A values) are grouped in sequence, step 1 and 2 could be achieved in one loop over the data. But I won't even make the assumption that the A entries are grouped.
The code below should be pretty easy to follow. Should you need comments on the syntax, just ask in the comments.
Perl take
#!/usr/bin/env perl
use strict; use warnings;
my (%recs, @heads);
while (<>) {
my ($head, $tail) = split /\s+/;
push @heads, $head unless $recs{$head};
push @{$recs{$head}}, $tail;
}
for my $head (@heads) {
my $vals = $recs{$head};
for my $i (0 .. $#$vals-1) {
for my $j ($i+1 .. $#$vals) {
print $head,"\t",$vals->[$i],"\t",$vals->[$j],"\n"
}
}
}
You may run and modify it online with your sample data, if you want.
Note that because Perl hashes are unordered, we need to keep the A entries in an array (@heads) in the order they appear.
AWK take
AWK is more elegant in my opinion. Note that it requires GNU awk for multidimensional array support.
#!/usr/bin/env gawk
BEGIN { OFS="\t" }
NF { count[$1]=++count[$1]
rec[$1][count[$1]]=$2 }
END {
for (head in rec)
for (i=1; i<length(rec[head]); i++)
for (j=i+1; j<=length(rec[head]); j++)
print head, rec[head][i], rec[head][j] }
Again, try it online.
What have you already tried? Please share with us