Can not use `cut -c` (`--characters`) with UTF-8?

Question

The command cut has an option -c to work on characters, instead of bytes with the option -b. But that does not seem to work, in en_US.UTF-8 locale:

The second byte gives the second ASCII character (which is encoded just the same in UTF-8):

$ printf 'ABC' | cut -b 2          
B

but does not give the second of three greek non-ASCII characters in UTF-8 locale:

$ printf 'αβγ' | cut -b 2         
�

That's alright - it's the second byte.
So we look at the second character instead:

$ printf 'αβγ' | cut -c 2 
�

That looks broken.
With some experiments, it turns out that the range 3-4 shows the second character:

$ printf 'αβγ' | cut -c 3-4
β

But that's just the same as the bytes 3 to 4:

$ printf 'αβγ' | cut -b 3-4
β

So the -c does not more than the -b for UTF-8.

I'd expect the locale setup is not right for UTF-8, but in comparison, wc works as expected;
It is often used to count bytes, with option -c (--bytes). ^{(Note the confusing option names.)}

$ printf 'αβγ' | wc -c
6

But it can also count characters with option -m (--chars), which just works:

$ printf 'αβγ' | wc -m
3

So my configuration seems to be ok - but something is special about cut.

Maybe it does not support UTF-8 at all? But it does seem to support multi-byte characters, otherwise it would not need to support -b and -c.

So, what's wrong? And why?

The locale setup looks right for utf8, as far as I can tell:

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

The input, byte by byte:

$ printf 'αβγ' | hd 
00000000  ce b1 ce b2 ce b3                                 |......|
00000006

Interesting! It looks like `-c` is using the same code as `-b`. Did you have a look at the source code? Maybe you can find a hint what `-c` is actually meant for. — michas, Oct 23 '14 at 06:11

Michael Homer · Accepted Answer · 2014-10-23T06:28:27.637

You haven't said which cut you're using, but since you've mentioned the GNU long option --characters I'll assume it's that one. In that case, note this passage from info coreutils 'cut invocation':

‘-c character-list’
‘--characters=character-list’
Select for printing only the characters in positions listed in character-list. The same as -b for now, but internationalization will change that.

(emphasis added)

For the moment, GNU cut always works in terms of single-byte "characters", so the behaviour you see is expected.

Supporting both the -b and -c options is required by POSIX — they weren't added to GNU cut because it had multi-byte support and they worked properly, but to avoid giving errors on POSIX-compliant input. The same -c has been done in some other cut implementations, although not FreeBSD's and OS X's at least.

This is the historic behaviour of -c. -b was newly added to take over the byte role so that -c can work with multi-byte characters. Maybe in a few years it will work as desired consistently, although progress hasn't exactly been quick (it's been over a decade already). GNU cut doesn't even implement the -n option yet, even though it is orthogonal and intended to help the transition. There are potential compatibility problems with old scripts, which may be a concern, although I don't know definitively what the reason is.

good work. youll find the same kind of comments in GNU's `tr` docs as well. and even `tar` unless i misremember. i guess its a big project. — mikeserv, Oct 23 '14 at 08:42
Is there any workaround for unicode probelm in `cut`? For example, where is it possible to download the sources for patched `cut`? Or would it be more easier to use another utility? (`grep` solution below does not work smoothly with ranges e.g. `5-8,44-49`) — dma_k, Jan 31 '18 at 00:11
see this 2017 article, sub-titled *”Random notes and pointers regarding the on-going effort to add multibyte and unicode support in GNU Coreutils“*: https://crashcourse.housegordon.org/coreutils-multibyte-support.html — myrdd, Dec 12 '18 at 14:29
you can find some alternatives to `cut -c` here: https://superuser.com/questions/506164/using-grep-to-display-second-character-in-string — myrdd, Dec 12 '18 at 14:32

score 13 · Answer 2 · answered Mar 22 '19 at 14:13

13

colrm (part of util-linux, should be already installed on most distributions) seems to handle internationalization much better :

$ echo 'αβγ' | colrm 3
αβ
$ echo 'αβγ' | colrm 2
α

Beware of the numbering : colrm N will remove columns from N, printing characters up to N-1.

(credits)

answered Mar 22 '19 at 14:13

Skippy le Grand Gourou

3,153
29
37

colrm doesn't seem to handle emojis well: `echo 'removethis' | colrm 2` returns nothing for me. – frabjous Jun 13 '22 at 14:43
@frabjous They seem to count for two characters, try `echo 'removethis' | colrm 3`. ;) – Skippy le Grand Gourou Jun 13 '22 at 15:48
1

@SkippyleGrandGourou no that's wrong. UTF-8, UTF-16 and UTF-32 are just different encodings of Unicode, and all can [represent characters up to U+10FFFF?](https://stackoverflow.com/q/52203351/995714). Characters outside the BMP are represented by 4 bytes in both UTF-8 and UTF-16 – phuclv Jul 11 '23 at 03:51
@phuclv Right, comment removed. Please keep yours as it’s informative (hopefully readers will understand it refers to a deleted comment and not to the answer…). – Skippy le Grand Gourou Jul 11 '23 at 13:30

Royce Williams · Answer 3 · 2023-07-10T15:13:27.553

7

Since many grep implementations are multibyte-aware, you can also use grep -o to simulate some uses of cut -c.

First two characters:

$ echo Τηεοδ29 | grep -o '^..'
Τη

Last three characters:

$ echo Τηεοδ29 | grep -o '...$'
δ29

Second character:

$ echo Τηεοδ29 | grep -o '^..' | grep -o '.$'
η

Adjust the number of periods, or use {x,y} syntax, to simulate cut ranges.

edited Jul 10 '23 at 15:13

answered Aug 20 '16 at 14:48

Royce Williams

1,180
10
20

1

no need for such complex solutions to get the second character. `echo Τηεοδ29 | grep -Po '(?<=^.).'` or `echo Τηεοδ29 | grep -Po '^.\K.'` will suffice – phuclv Jul 11 '23 at 04:16

score 1 · Answer 4 · answered Jul 12 '23 at 18:03

Eight+ years later, I can't reproduce the OP's issue (MacOS 13.4 Ventura):

~$ printf 'ABC' | cut -b 2
B
~$ printf 'αβγ' | cut -b 2
�
~$ printf 'αβγ' | cut -c 2
β
~$ printf 'αβγ' | cut -c 3-4
γ
~$ printf 'αβγ' | cut -b 3-4
β
~$ printf 'αβγ' | wc -c
       6
~$ printf 'αβγ' | wc -m
       3

Above seems to be the answer the OP was hoping for? Note the line ending cut -c 3-4 actually returns γ% under zsh, indicating a partial line (more characters requested than could be returned).

-$ man cut doesn't give me a version other than macOS 13.4 August 3, 2017, IEEE Std 1003.2-1992 (“POSIX.2”), with an additional -w flag as an extension to the specification. "HISTORY: A cut command appeared in AT&T System III UNIX."

Can not use `cut -c` (`--characters`) with UTF-8?

4 Answers4

Linked