Question

Difference Analysis by shell scripting

0

Entering edit mode

6.3 years ago

waqarlodhi93 • 0

I have a dataset with 61 lists, I want to compare each list with remaining 60, and grep the difference in each list with remaining others

This is how list look like

C 00010 Glycolysis / Gluconeogenesis [PATH:hca00010] 
C 00020 Citrate cycle (TCA cycle) [PATH:hca00020] 
C 00030 Pentose phosphate pathway [PATH:hca00030] 
C 00040 Pentose and glucuronate interconversions [PATH:hca00040] 
C 00051 Fructose and mannose metabolism [PATH:hca00051] 
C 00052 Galactose metabolism [PATH:hca00052] 
C 00500 Starch and sucrose metabolism [PATH:hca00500] 
C 00520 Amino sugar and nucleotide sugar metabolism [PATH:hca00520]

lists contain near about similar data, I want output only of dissimilar lines among all lists with list name.

shell scripting Linux • 1.1k views

ADD COMMENT • link updated 6.3 years ago by cpad0112 21k • written 6.3 years ago by waqarlodhi93 • 0

3

Entering edit mode

what about just:

cat your_files* | sort | uniq -c

?

ADD REPLY • link 6.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

uniq -c or uniq -cu? OP wants dissimilar lines. As dissimilar ones occurs only once, c is redundant. uniq -u should be sufficient

ADD REPLY • link 6.3 years ago by cpad0112 21k

score 1 · Answer 1 · 2018-08-04

output (change the code as per delimiters in original text and make sure only input text files are in the directory):

$ grep . *.txt | sed "s/:/\t/" | sort -k3,3 | uniq -f3 -u

or

$ awk '{print FILENAME,$0}' *.txt | sort -k3,3  | uniq -f3 -u
a.txt   C 00030 Pentose phosphate pathway [PATH:hca00030]

input (from a.txt, b.txt, c.txt):

$ grep -H . *.txt 
a.txt:C 00010 Glycolysis / Gluconeogenesis [PATH:hca00010] 
a.txt:C 00020 Citrate cycle (TCA cycle) [PATH:hca00020] 
a.txt:C 00030 Pentose phosphate pathway [PATH:hca00030] 
a.txt:C 00040 Pentose and glucuronate interconversions [PATH:hca00040] 
a.txt:C 00051 Fructose and mannose metabolism [PATH:hca00051] 
a.txt:C 00052 Galactose metabolism [PATH:hca00052] 
a.txt:C 00500 Starch and sucrose metabolism [PATH:hca00500] 
a.txt:C 00520 Amino sugar and nucleotide sugar metabolism [PATH:hca00520]
b.txt:C 00010 Glycolysis / Gluconeogenesis [PATH:hca00010] 
b.txt:C 00020 Citrate cycle (TCA cycle) [PATH:hca00020] 
b.txt:C 00040 Pentose and glucuronate interconversions [PATH:hca00040] 
b.txt:C 00051 Fructose and mannose metabolism [PATH:hca00051] 
b.txt:C 00052 Galactose metabolism [PATH:hca00052] 
b.txt:C 00520 Amino sugar and nucleotide sugar metabolism [PATH:hca00520]
c.txt:C 00052 Galactose metabolism [PATH:hca00052] 
c.txt:C 00500 Starch and sucrose metabolism [PATH:hca00500] 
c.txt:C 00520 Amino sugar and nucleotide sugar metabolism [PATH:hca00520]