How to count unique occurrences of lines in Linux
1
0
Entering edit mode
2.3 years ago
Alex S ▴ 20

I have a file that looks like this:

C.Chr1:75500000-95000000:1029180-1029225
C.Chr1:75500000-95000000:1033800-1033847
C.Chr1:75500000-95000000:1035240-1035285
C.Chr1:75500000-95000000:1035460-1035505
C.Chr2:584000000-610000000:17911000-17911047
C.Chr2:584000000-610000000:17911000-17911047
C.Chr2:584000000-610000000:17911000-17911047
C.Chr3:30000000-130000000:21437320-21437367
C.Chr3:30000000-130000000:21437380-21437425
C.Chr3:30000000-130000000:21437700-21437747
C.Chr3:30000000-130000000:21438080-21438127

I need to count how many lines are unique, not considering the repeated lines.

I've tried uniq -c | sort -bgr but the number of lines is way smaller than expected, and I think it can be a problem in the uniq function.

Anyone knows another code or function that would help?

uniq Linux ubuntu • 5.4k views
ADD COMMENT
4
Entering edit mode
2.3 years ago
sort <file> | uniq -u | wc -l 

(nearly) always pass sorted files to uniq , then use uniq -u (to report the unique lines) then pass to wc -l for the counting

(keep in mind this will count the lines that are unique in your original file, NOT the number of lines when the files has been made non-redundant)

ADD COMMENT
0
Entering edit mode

I like sort -u followed by uniq. (Had a situation recently where uniq did not work on its own, it is probably redundant here).

ADD REPLY
0
Entering edit mode

I like sort -u followed by uniq.

sort -u does not need to be followed by uniq as it already constricts the file to its unique subset.

sort -u <file> | wc -l
ADD REPLY
0
Entering edit mode

sort -u will result in non-redundant subset, not unique. For anything unique you will need to use uniq (with the -u option)

yes, it's a bit semantics but it is crucial in certain circumstances.

ADD REPLY
0
Entering edit mode

I am trying to understand if this is a distinction without a difference, or something that can be important in practice. What would be an example on multiple lines in a file where sort -u <file> | wc -l and sort <file> | uniq -u | wc -l will give a different output?

In man uniq it says:

Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'.

ADD REPLY
1
Entering edit mode

well,

uniq -u only prints the unique lines in the files (== those that are only present once and no others) ; it's the opposite behaviour of uniq -d (== print only lines that are repeated in the input file)

sort -u makes the file non-redundant (== one representative of each repeated line is kept)

of course, and indeed, this all only applies when files are correctly sorted (though running uniq on unsorted files sometimes pretty useful to get a desired result)

sort <file> | uniq will give the exact same output as sort -u <file> (and the same as sort -u file | uniq -u for that matter , but that's just a waste of option usage :) )

ADD REPLY
0
Entering edit mode

It works!! Thanks a lot.

ADD REPLY

Login before adding your answer.

Traffic: 1742 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6