4

Sort is sorting differently than I would expect. I have this file, call it text.txt:

a   1
A   1
a   11

(the space is always one \t)

I want to sort them alphabetically by the first column. However, when I do

sort -k 1 text.txt

all I got is the text.txt file, not sorted. If I do it by the deprecated + - notation, meaning

sort +0 -1 text.txt

it works as it should, meaning that I get this output:

a   1
a   11
A   1

This strange behaviour occurs only when I have lines that differs only by case. What am I doing wrong?

Karel Bílek
  • 1,859
  • 5
  • 20
  • 26
  • In the default locale, `A` would be before `a`. Most locales have a weird collation order; it's often best to keep `LC_COLLATE=C`. On this issue, see [Does (should) LC_COLLATE affect character ranges?](http://unix.stackexchange.com/q/15980) [Why are capital letters included in a range of lower-case letters in an awk regex?](http://unix.stackexchange.com/q/19322) – Gilles 'SO- stop being evil' Sep 01 '11 at 22:45
  • This issue, however, stands with LC_COLLATE=C (I think, I don't have a shell here now), the problem was really in the columns... however.... it probably should not be. I don't know. I will test it tomorrow. Thanks for the tips. – Karel Bílek Sep 03 '11 at 02:57

2 Answers2

4

You have to specify the end column, too:

$ sort -k1,1 text.txt
a       1
a       11
A       1

To quote the GNU sort man page:

   -k, --key=POS1[,POS2]
          start a key at POS1 (origin 1), end it at POS2 (default  end  of
          line)
maxschlepzig
  • 56,316
  • 50
  • 205
  • 279
  • Oh. I guess I should have RTFM more carefully :) Thanks. – Karel Bílek Aug 31 '11 at 20:38
  • 1
    It is really strange that this was needed - the `POS2` is evidently **optional**. Maybe a bug in `sort`? – rozcietrzewiacz Sep 01 '11 at 08:15
  • 1
    @rozcietrzewiacz, yes, it is optional. No, it is not a bug. POS2 is optional, but when you don't specify it then the default is used, i.e. the key then contains all columns up to the next newline. – maxschlepzig Sep 01 '11 at 17:20
2

You most certainly hit upon a bug in sort! If you had no spaces in the file, there would be no way to sort it properly:

$ cat aaa
a1
A1
a11

$ sort aaa
a1
A1
a11

$ sort -k1,1 aaa
a1
A1
a11

Even more visible with the following:

$ cat bbb
A B b 0
a B b 0
A b b 1

$ sort bbb
a B b 0
A B b 0
A b b 1

$ sort -k1,2 bbb
a B b 0
A b b 1
A B b 0
rozcietrzewiacz
  • 38,754
  • 9
  • 94
  • 102
  • That behaviour is correct, though. I think. – Karel Bílek Sep 01 '11 at 20:35
  • 1
    Not a bug, but a collation locale issue: Karel clearly has a `$LC_LOCALE` that is not `POSIX`, since `A` isn't before `a`. On this issue, see [Does (should) LC_COLLATE affect character ranges?](http://unix.stackexchange.com/q/15980) [Why are capital letters included in a range of lower-case letters in an awk regex?](http://unix.stackexchange.com/q/19322) – Gilles 'SO- stop being evil' Sep 01 '11 at 22:43
  • 1
    @Gilles The problem isn't just that `A` is before or after `a` - it is that in *some cases* it is treated as before, in other as after. – rozcietrzewiacz Sep 02 '11 at 06:17
  • @rozcietrzewiacz Exactly, this is consistent with a locale where `a` and `A` are equivalent. – Gilles 'SO- stop being evil' Sep 02 '11 at 06:25
  • 1
    @Gilles Ok, so I understand now that the problem is with locale, not `sort` itself - but two letters being defined as *equivalent*? This is absurd! Who and why ever came up with this? – rozcietrzewiacz Sep 02 '11 at 07:04
  • @rozcietrzewiacz Think case insensitive sorts. And `à` = `à` (U+00E0 LATIN SMALL LETTER A WITH GRAVE = U+0061 LATIN SMALL LETTER A, U+0300 COMBINING GRAVE ACCENT). And in some languages (e.g. French), `là` is sorted after `la` but before `label`. `sort` doesn't get everything right because it performs a lexicographic sort (which, despite the etymology, is not what dictionaries do), but it tries to go a little beyond the basics. The results aren't so great (I keep `LC_COLLATE=C` myself). – Gilles 'SO- stop being evil' Sep 02 '11 at 07:12
  • 1
    @Gilles Case insensitive sorts would be nice as an option in some cases - but not as the default for locales like `en_US.UTF-8`! Furthermore, the case is even more complicated than this. The simple test suggested [in this answer](http://unix.stackexchange.com/questions/19322/why-are-capital-letters-included-in-a-range-of-lower-case-letters-in-an-awk-regex/19327#19327) prints, depending on locale, either `A > a` or `a < A` but never `A = a`. Yet, the `sort` aliasing... – rozcietrzewiacz Sep 02 '11 at 07:22