Everywhere I see someone needing to get a sorted, unique list, they always pipe to sort | uniq. I've never seen any examples where someone uses sort -u instead. Why not? What's the difference, and why is it better to use uniq than the unique flag to sort?
-
1http://aplawrence.com/Unixart/sort-vs-uniq.html – Lesmana May 16 '13 at 11:35
5 Answers
sort | uniq existed before sort -u, and is compatible with a wider range of systems, although almost all modern systems do support -u -- it's POSIX. It's mostly a throwback to the days when sort -u didn't exist (and people don't tend to change their methods if the way that they know continues to work, just look at ifconfig vs. ip adoption).
The two were likely merged because removing duplicates within a file requires sorting (at least, in the standard case), and is an extremely common use case of sort. It is also faster internally as a result of being able to do both operations at the same time (and due to the fact that it doesn't require IPC (Inter-process communication) between uniq and sort). Especially if the file is big, sort -u will likely use fewer intermediate files to sort the data.
On my system I consistently get results like this:
$ dd if=/dev/urandom of=/dev/shm/file bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 8.95208 s, 11.7 MB/s
$ time sort -u /dev/shm/file >/dev/null
real 0m0.500s
user 0m0.767s
sys 0m0.167s
$ time sort /dev/shm/file | uniq >/dev/null
real 0m0.772s
user 0m1.137s
sys 0m0.273s
It also doesn't mask the return code of sort, which may be important (in modern shells there are ways to get this, for example, bash's $PIPESTATUS array, but this wasn't always true).
- 40,939
- 15
- 71
- 114
- 122,090
- 24
- 265
- 262
-
48I tend to use `sort | uniq` because 9 times out of 10, I'm actually piping to `uniq -c`. – Plutor May 16 '13 at 14:16
-
7Note that `sort -u` was part of 7th Edition UNIX, circa 1979. Versions of `sort` without support for `-u` are truly archaic — or were written without attention to the de facto standard before POSIX's de jure standard. See also Stack Overflow [Sort & uniq in Linux shell](http://stackoverflow.com/questions/3382936/sort-uniq-in-linux-shell) from 2010. – Jonathan Leffler Feb 18 '15 at 16:34
-
3+1 because of `ip`. It's 2016 and this post in 2013, but I only know about `ip` command now. – dieend May 27 '16 at 02:22
-
5+1 for "9 times out 10 I'm actually piping to `uniq -c` " (and maybe piping once more to `sort -nr | head` ). I was wondering what is the equivalent to `sort | uniq` in Vim when I found out that Vim has `:sort u` command. And TIL `sort -u` exists as well. – Zhuoyun Wei Oct 13 '17 at 07:09
-
2Note that there is a difference when using `sort -n | uniq` vs. `sort -n -u`. For example trailing and leading whitespaces will be seen as duplicates by `sort -n -u` but not by the former! `echo -e 'test \n test' | sort -n -u` returns `test`, but `echo -e 'test \n test' | sort -n | uniq` returns both lines. – mxmlnkn Jan 10 '18 at 23:05
-
Another problem with `sort -n -u` becomes apparent with this `echo -e '14a-foo\n14b-bar\n15' | sort -n -u` ... i.e. the `14b-bar` will be deleted! Not sure if this is a bug or not, though. This does not happen with with `sort -n | uniq`. Imo you should never use `sort -n -u`, it only leads to trouble. – mxmlnkn Mar 19 '18 at 16:38
-
-
1@stephanmg `$ pacman -Qo ifconfig` -> "/usr/bin/ifconfig is owned by net-tools 2.10-1"; `$ pacman -Qo ip` -> "/usr/bin/ip is owned by iproute2 5.10.0-2". – kelvin Feb 17 '21 at 01:39
With POSIX compliant sorts and uniqs (GNU uniq is currently not compliant in that regard), there's a difference in that sort uses the locale's collating algorithm to compare strings (will typically use strcoll() to compare strings) while uniq checks for byte-value identity (will typically use strcmp())¹.
That matters for at least two reasons.
In some locales, especially on GNU systems, there are different characters that sort the same. For instance, in the en_US.UTF-8 locale on a GNU system, all the ①②③④⑤⑥⑦⑧⑨⑩... characters² and many others sort the same because their sort order is not defined. The 0123456789 arabic digits sort the same as their Eastern Arabic Indic counterparts (٠١٢٣٤٥٦٧٨٩).
For
sort -u, ① sorts the same as ② and 0123 the same as ٠١٢٣ sosort -uwould retain only one of each, while foruniq(not GNUuniqwhich usesstrcoll()(except with-f)), ① is different from ② and 0123 different from ٠١٢٣, souniqwould consider all 4 unique.strcollcan only compare strings of valid characters (the behaviour is undefined as per POSIX when the input has sequences of bytes that don't form valid characters) whilestrcmp()doesn't care about characters since it only does byte-to-byte comparison. So that's another reason whysort -umay not give you all the unique lines if some of them don't form valid text.sort|uniq, while still unspecified on non-text input, in practice is more likely to give you unique lines for that reason.
Beside those subtleties, one thing that hasn't been noted so far is that uniq compares whole line lexically, while sort's -u compares based on the sort specification given on the command line.
$ printf '%s\n' 'a b' 'a c' | sort -uk 1,1
a b
$ printf '%s\n' 'a b' 'a c' | sort -k 1,1 | uniq
a b
a c
$ printf '%s\n' 0 -0 +0 00 '' | sort -n | uniq
0
-0
+0
00
$ printf '%s\n' 0 -0 +0 00 '' | sort -nu
0
¹ Prior versions of the POSIX spec were causing confusion however by listing the LC_COLLATE variable as one affecting uniq, that was removed in the 2018 edition and the behaviour clarified following that discussion mentioned above. See the corresponding Austin group bug
² 2019 edit. Those have since been fixed, but over 95% of Unicode code points still have an undefined order as of version 2.30 of the GNU libc. You can test with instead for instance in newer versions
- 522,931
- 91
- 1,010
- 1,501
One difference is that uniq has a number of useful additional options, such as skipping fields for comparison and counting the number of repetitions of a value. sort's -u flag only implements the functionality of the unadorned uniq command.
- 659
- 4
- 3
-
3+0.49 for a useful answer, but I would phrase it something like "The output of `sort -u` can't be passed to `uniq` to use some of the latter's useful options, such as skipping fields for comparison and counting the number of repetitions." – l0b0 May 16 '13 at 14:10
-
15+1 to offset the naysayers because "there's no way to do this directly from sort" _does_ answer the question... – Izkata May 16 '13 at 15:28
-
Landed here because I want `uniq -u` (_only_ unique rows) behaviour and I can't seem to get it from sort (also want GNU `uniq -w` and can't get that from BSD). So yes, this answer is important. – sh1 Mar 14 '22 at 17:59
I prefer to use sort | uniq because when I try to use the -u (eliminate duplicates) option to remove duplicates involving mixed case strings, it is not that easy to understand the result.
Note: before you can run the examples below, you need to simulate the standard C collating sequence by doing the following:
LC_ALL=C
export LC_ALL
For example, if I want to sort a file and remove duplicates, while at the same time, keeping the different cases of strings distinct.
$ cat short #file to sort
Pear
Pear
apple
pear
Apple
$ sort short #normal sort (in normal C collating sequence)
Apple #the lower case words are at the end
Pear
Pear
apple
pear
$ sort -f short #correctly sorts ignoring the C collating order
Apple #but duplicates are still there
apple
Pear
Pear
pear
$ sort -fu short #By adding the -u option to remove duplicates it is
apple #difficult to ascertain the logic that sort uses to remove
Pear #duplicates(i.e., why did it remove pear instead of Pear?)
This confusion is solved by not using the -u option to remove duplicates. Using uniq is more predictable. The below first sorts and ignores the case and then passes it to uniq to remove the duplicates.
$ sort -f short | uniq
Apple
apple
Pear
pear
- 22,130
- 27
- 68
- 117
- 543
- 5
- 7
-
3`-u` option of `sort` outputs the **first** of an equal run (see man page). Thus `sort -fu` picks up the first occurence of every case-insensitive unique line. The logic that `sort` uses to remove duplicates is predictable. – pallxk Oct 09 '15 at 15:33
Another difference I found out today is when sorting based on a delimeter where sort -u applies the unique flag only on the column that you sort with.
$ cat input.csv
3,World,1
1,Hello,1
2,Hello,1
$ cat input.csv | sort -t',' -k2 -u
1,Hello,1
3,World,1
$ cat input.csv | sort -t',' -k2 | uniq
1,Hello,1
2,Hello,1
3,World,1
- 161
- 1
- 4
-
This is mentioned in an answer from Stéphane Chazelas but I like your example so +1 – roaima Jan 06 '17 at 09:16
-
Thanks for pointing out @roaima, it wasn't very clear in that answer – Stefanos Chrs Jan 06 '17 at 09:19