Difference Analysis by shell scripting
1
0
Entering edit mode
6.3 years ago

I have a dataset with 61 lists, I want to compare each list with remaining 60, and grep the difference in each list with remaining others

This is how list look like

C 00010 Glycolysis / Gluconeogenesis [PATH:hca00010] 
C 00020 Citrate cycle (TCA cycle) [PATH:hca00020] 
C 00030 Pentose phosphate pathway [PATH:hca00030] 
C 00040 Pentose and glucuronate interconversions [PATH:hca00040] 
C 00051 Fructose and mannose metabolism [PATH:hca00051] 
C 00052 Galactose metabolism [PATH:hca00052] 
C 00500 Starch and sucrose metabolism [PATH:hca00500] 
C 00520 Amino sugar and nucleotide sugar metabolism [PATH:hca00520]

lists contain near about similar data, I want output only of dissimilar lines among all lists with list name.

shell scripting Linux • 1.1k views
ADD COMMENT
3
Entering edit mode

what about just:

cat your_files* | sort | uniq -c

?

ADD REPLY
0
Entering edit mode

uniq -c or uniq -cu? OP wants dissimilar lines. As dissimilar ones occurs only once, c is redundant. uniq -u should be sufficient

ADD REPLY
1
Entering edit mode
6.3 years ago

output (change the code as per delimiters in original text and make sure only input text files are in the directory):

$ grep . *.txt | sed "s/:/\t/" | sort -k3,3 | uniq -f3 -u

or

$ awk '{print FILENAME,$0}' *.txt | sort -k3,3  | uniq -f3 -u
a.txt   C 00030 Pentose phosphate pathway [PATH:hca00030]

input (from a.txt, b.txt, c.txt):

$ grep -H . *.txt 
a.txt:C 00010 Glycolysis / Gluconeogenesis [PATH:hca00010] 
a.txt:C 00020 Citrate cycle (TCA cycle) [PATH:hca00020] 
a.txt:C 00030 Pentose phosphate pathway [PATH:hca00030] 
a.txt:C 00040 Pentose and glucuronate interconversions [PATH:hca00040] 
a.txt:C 00051 Fructose and mannose metabolism [PATH:hca00051] 
a.txt:C 00052 Galactose metabolism [PATH:hca00052] 
a.txt:C 00500 Starch and sucrose metabolism [PATH:hca00500] 
a.txt:C 00520 Amino sugar and nucleotide sugar metabolism [PATH:hca00520]
b.txt:C 00010 Glycolysis / Gluconeogenesis [PATH:hca00010] 
b.txt:C 00020 Citrate cycle (TCA cycle) [PATH:hca00020] 
b.txt:C 00040 Pentose and glucuronate interconversions [PATH:hca00040] 
b.txt:C 00051 Fructose and mannose metabolism [PATH:hca00051] 
b.txt:C 00052 Galactose metabolism [PATH:hca00052] 
b.txt:C 00520 Amino sugar and nucleotide sugar metabolism [PATH:hca00520]
c.txt:C 00052 Galactose metabolism [PATH:hca00052] 
c.txt:C 00500 Starch and sucrose metabolism [PATH:hca00500] 
c.txt:C 00520 Amino sugar and nucleotide sugar metabolism [PATH:hca00520]
ADD COMMENT

Login before adding your answer.

Traffic: 2753 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6