! /usr/bin/perl -w

Question

Consolidating Two Columns

3

Entering edit mode

13.3 years ago

Kiriya ▴ 100

I have list two columns like this:

Col1    Col2
Name1    AB, AC, CF
Name1    AF, AV, CG, HG
Name2    BB, BF, CD, CK, JK
Name2    BC

I want to consolidate the two columns so they would like the following:

Col1    Col2
Name1    AB, AC, CF, AF, AV, CG, HG
Name2    BB, BF, CD, CK, JK, BC

Does anyone has script to do this?

programming python perl awk • 3.6k views

ADD COMMENT • link updated 13.3 years ago by Rm 8.3k • written 13.3 years ago by Kiriya ▴ 100

1

Entering edit mode

I have KEGG results and I want to put all the sequences that are in one pathway together. I didn't want to make it too complicated. Here is an example: C5-Branched dibasic acid metabolism KK_Contig_49268 C5-Branched dibasic acid metabolism KK_Contig_12740, KK_Contig_52938, KK_Contig_51604, KK_Contig_9479, KK_Contig_49400, KK_Contig_28354 Glycolysis / Gluconeogenesis KK_Contig_50816, KK_Contig_8607, KK_Contig_15245, KK_Contig_22682 Glycolysis / Gluconeogenesis KK_Contig_27393

ADD REPLY • link 13.3 years ago by Kiriya ▴ 100

0

Entering edit mode

Hi, can you please explain what this has to do with bioinformatics? Pure programming questions are discouraged, please read the FAQ.

ADD REPLY • link 13.3 years ago by Michael 55k

0

Entering edit mode

Are the 2 columns separated by a tab?

ADD REPLY • link 13.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

Yes, they are separated by a tab.

ADD REPLY • link 13.3 years ago by Kiriya ▴ 100

0

Entering edit mode

Do you care about the order of the entries being consolidated? Should duplicates be removed or not?

ADD REPLY • link 13.3 years ago by Peter 6.0k

0

Entering edit mode

No, I don't care about the order and duplicates should be removed in the second column.

ADD REPLY • link 13.3 years ago by Kiriya ▴ 100

score 2 · Answer 1 · 2011-08-21

A script to do this would normally consist of two steps (irrespective of which programming language you use):

Read the input file and collect all data in key-value data structure (this would be called a hash table in Perl, a dictionary in Python, or an unordered map in C++). The key would be what is in column 1 and the value would be collecting all results pertaining to the key in question.
Iterate over all the (sorted) keys in the your key-value data structure and print out all the pairs to produce the output file.

score 2 · Answer 2 · 2011-08-22

2

Entering edit mode

13.3 years ago

Aaronquinlan 12k

Check out the "groupBy" command in my filo package. It is designed to handle exactly this problem. GroupBy only requires that 1) your input data be sorted by the columns that you want to group/consolidate, and 2), your data are tab-delimited. Here is an example with your data:

$ cat data.txt 
#Col1   Col2
Name1   AB, AC, CF
Name1   AF, AV, CG, HG
Name2   BB, BF, CD, CK, JK
Name2   BC

$ groupBy -i data.txt -grp 1 -opCols 2 -ops collapse
Name1   AB, AC, CF,AF, AV, CG, HG
Name2   BB, BF, CD, CK, JK,BC

Or, more briefly:

$ groupBy -i data.txt -g 1 -c 2 -o collapse
Name1   AB, AC, CF,AF, AV, CG, HG
Name2   BB, BF, CD, CK, JK,BC

Here are some other usage examples:

ADD COMMENT • link 13.3 years ago by Aaronquinlan 12k

0

Entering edit mode

Yes, this seems like a very useful tool although I didn't have to use it. One thing that is not clear is what do I need to do to install the package? I already downloaded it. Sorry if this sounds like a silly question - I am more of a biologist!

ADD REPLY • link 13.3 years ago by Kiriya ▴ 100

0

Entering edit mode

One should merely need to download the .tar.gz file (or clone it with git), unpack the tarball, cd into the newly created directory, and type "make".

ADD REPLY • link 13.3 years ago by Aaronquinlan 12k

score 2 · Answer 3 · 2011-08-23

2

Entering edit mode

13.3 years ago

Rm 8.3k

awk -F"t" '{if(NR==1){print} else hsh[$1]=hsh[$1]$2", "}END{for (i in hsh){print i" "hsh[i]}}' input_file | sed 's/, $//'

Output:

Col1    Col2
Name1 AB, AC, CF, AF, AV, CG, HG
Name2 BB, BF, CD, CK, JK, BC

ADD COMMENT • link 13.3 years ago by Rm 8.3k

score 1 · Answer 4 · 2011-08-21

Assuming the 2 columns separated by a tab, sample input file will be:

echo -e "Col1\tCol2
Name1\tAB, AC, CF
Name1\tAF, AV, CG, HG
Name2\tBB, BF, CD, CK, JK
Name2\tBC" > sample1

[?]

One-liner-solution:

cat sample1 | ruby -e 'c = Hash.new([]); while l = STDIN.gets; next if STDIN.lineno == 1; key, values = l.chomp.split "\t"; values = values.split ", "; c[key] += values; end; puts "Col1\tCol2"; c.each do |key, values|; vjoined=values.join ", "; puts "#{key}\t#{vjoined}"; end'

[?]

Output:

Col1    Col2
Name1   AB, AC, CF, AF, AV, CG, HG
Name2   BB, BF, CD, CK, JK, BC

[?]

The code in readable format (filename: consolidate.rb):

#!/usr/bin/env ruby

c = Hash.new []

while l = STDIN.gets
  next if STDIN.lineno == 1 # Cut the header off (first line)
  key, values = l.chomp.split "\t" 
  values = values.split ", "
  c[key] += values
end

puts "Col1\tCol2"
c.each do |key, values|
  puts "#{key}\t#{values.join ', '}"
end

Aaronquinlan · Answer 5 · 2011-08-22

1

Entering edit mode

13.3 years ago

Anjan ▴ 840

! /usr/bin/perl -w

use strict;

open (F, "file") || die "cannot open file" ; #file has data in two tab-delimited columns.

my %hoa; # a hash of arrays.

while (<F>){
   chomp;
   my @line = split/\t/;
   push (@{$hoa{$line[0]}}, $line[1]);
}

foreach my $k (sort keys %hoa){
   my $data = join(",", @{$hoa{$k}});
   print("$k\t$data\n");
}
#amen.

have not checked script...but you get the idea.

ADD COMMENT • link updated 13.3 years ago by Aaronquinlan 12k • written 13.3 years ago by Anjan ▴ 840

0

Entering edit mode

I just tweaked your "join" for correctness.

ADD REPLY • link 13.3 years ago by Aaronquinlan 12k

0

Entering edit mode

I guess it's a perl thing to end your script in "amen"?

ADD REPLY • link 13.3 years ago by brentp 24k

0

Entering edit mode

thanks @brentp, goofed my editing of the code example.

ADD REPLY • link 13.3 years ago by Aaronquinlan 12k

0

Entering edit mode

The script is working well and thanks a lot!

ADD REPLY • link 13.3 years ago by Kiriya ▴ 100

0

Entering edit mode

@aaronQuinlan: thanks for the correction.

ADD REPLY • link 13.3 years ago by Anjan ▴ 840