Assuming the 2 columns separated by a tab, sample input file will be:
echo -e "Col1\tCol2
Name1\tAB, AC, CF
Name1\tAF, AV, CG, HG
Name2\tBB, BF, CD, CK, JK
Name2\tBC" > sample1
[?]
One-liner-solution:
cat sample1 | ruby -e 'c = Hash.new([]); while l = STDIN.gets; next if STDIN.lineno == 1; key, values = l.chomp.split "\t"; values = values.split ", "; c[key] += values; end; puts "Col1\tCol2"; c.each do |key, values|; vjoined=values.join ", "; puts "#{key}\t#{vjoined}"; end'
[?]
Output:
Col1 Col2
Name1 AB, AC, CF, AF, AV, CG, HG
Name2 BB, BF, CD, CK, JK, BC
[?]
The code in readable format (filename: consolidate.rb):
#!/usr/bin/env ruby
c = Hash.new []
while l = STDIN.gets
next if STDIN.lineno == 1 # Cut the header off (first line)
key, values = l.chomp.split "\t"
values = values.split ", "
c[key] += values
end
puts "Col1\tCol2"
c.each do |key, values|
puts "#{key}\t#{values.join ', '}"
end
I have KEGG results and I want to put all the sequences that are in one pathway together. I didn't want to make it too complicated. Here is an example: C5-Branched dibasic acid metabolism KK_Contig_49268 C5-Branched dibasic acid metabolism KK_Contig_12740, KK_Contig_52938, KK_Contig_51604, KK_Contig_9479, KK_Contig_49400, KK_Contig_28354 Glycolysis / Gluconeogenesis KK_Contig_50816, KK_Contig_8607, KK_Contig_15245, KK_Contig_22682 Glycolysis / Gluconeogenesis KK_Contig_27393
Hi, can you please explain what this has to do with bioinformatics? Pure programming questions are discouraged, please read the FAQ.
Are the 2 columns separated by a tab?
Yes, they are separated by a tab.
Do you care about the order of the entries being consolidated? Should duplicates be removed or not?
No, I don't care about the order and duplicates should be removed in the second column.