How to make combinations of the values in column 2
1
0
Entering edit mode
4.1 years ago

Hello, I need to make random nonredundant combinations of strings (in column 2) under a particular criteria (in column 1). The Input and outputs will be as follows. Note: For simplification, I have considered the combination of three string values.

INPUT:

Uniprot_ID_A Uniprot_ID_B

P00001 Q00001

P00001 Q00002

P00001 Q00003

P00002 R00001

P00002 R00002

P00002 R00003

OUTPUT:

Uniprot_ID_A Combinations_of_Uniprot_ID_B

P00001 <tab> Q00001 <tab> Q00002

P00001 <tab> Q00001 <tab> Q00003

P00001 <tab> Q00002 <tab> Q00003

P00002 <tab> R00001 <tab> R00002

P00002 <tab> R00001 <tab> R00003

P00002 <tab> R00002 <tab> R00003

The combinations should be tab separated and the first column will be printed in the output. As I am not a coding expert, hence simple solutions will be highly appreciated. Thanks in Advance.

combinations random combinations perl awk python • 857 views
ADD COMMENT
1
Entering edit mode

What have you already tried? Please share with us

ADD REPLY
0
Entering edit mode
4.0 years ago
i-blis • 0

You probably figured it out (or had it figured out) by now. In case you'd still need some enlightenment to get started the next time you face a similar problem, here you go.

If I got the problem statement right, you want all unordered pairs of values of column B for each value of column A, keeping everything in the same order as they appear.

This involves 2 steps:

  1. Gather all the B values for a given A entries
  2. Given a list of B values, build a list of pairs shifting indices forward twice:
  3. (b_1,b_2), (b_1,b_3), ... , (b_1,b_n) ; (b_2,b_3) , ... , (b_3,b_n) ; ... ; (b_n-1, b_n)

Step 1 is easily achieved with an associative array (as AWK aptly names them), also known as a hash (in Perl), dictionary (in Python) or map (in most functional languages).

Step 2 amounts to three embedded loops. Looping over unique A entries, looping twice over B values to get a all pairs (shifting the indices to avoid getting (b_m,b_n) for m >= n).

If the amount of records were huge and granted that keys (column A values) are grouped in sequence, step 1 and 2 could be achieved in one loop over the data. But I won't even make the assumption that the A entries are grouped.

The code below should be pretty easy to follow. Should you need comments on the syntax, just ask in the comments.

Perl take

#!/usr/bin/env perl
use strict; use warnings;

my (%recs, @heads); 
while (<>) {
    my ($head, $tail) = split /\s+/;
    push @heads, $head unless $recs{$head};
    push @{$recs{$head}}, $tail;
}

for my $head (@heads) {
    my $vals = $recs{$head};
    for my $i (0 .. $#$vals-1) {
        for my $j ($i+1 .. $#$vals) {
            print $head,"\t",$vals->[$i],"\t",$vals->[$j],"\n"
        }
    }
}

You may run and modify it online with your sample data, if you want.

Note that because Perl hashes are unordered, we need to keep the A entries in an array (@heads) in the order they appear.

AWK take

AWK is more elegant in my opinion. Note that it requires GNU awk for multidimensional array support.

#!/usr/bin/env gawk

BEGIN { OFS="\t" }

NF { count[$1]=++count[$1]
     rec[$1][count[$1]]=$2 }

END {
    for (head in rec) 
        for (i=1; i<length(rec[head]); i++) 
            for (j=i+1; j<=length(rec[head]); j++) 
                print head, rec[head][i], rec[head][j] }

Again, try it online.

ADD COMMENT

Login before adding your answer.

Traffic: 2287 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6