What is "length" of a string in Bourne shell compatibles' `${#string}`?

Question

Arising from this discussion:

When I have (zsh 5.8, bash 5.1.0)

var="ASCII"
echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"

the answer is simple: these are 5 characters, occupying five bytes.

Now, var=Müller yields

Müller has the length 6, and is 7 bytes long

Which suggests the ${#} operator counts codepoints, not bytes. This is a bit unclear in POSIX, where they say it counts "characters". This would be clearer if characters in POSIX C weren't octets, normally.

Anyways: Nice! Kind of good, seeing that LANG==en_US.utf8.

Now,

var='‍♀️'
echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"

‍♀️ has the length 5, and is 17 bytes long

Soooo, we decompose "Mermaid of dark skin color" into the Unicode codepoint

Merperson
Dark skin tone
Zero-Width Joiner
Female
Print print the previous character as emoji

Fine, so we're really counting Unicode codepoints!

var="e\xcc\x81"
echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"

é has the length 9, and is 9 bytes long

(of course, my console font decided that the ´ combines with the following space, not the preceding e. The latter would be correct. But let's leave my rage about that for somewhen else.)

Um, a slight "wat" is in order here.

> printf "e\xcc\x81"|wc -c
3
> printf "%s" "${var}" |wc -c
9
> echo -n ${var} |wc -c
3
> echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"
é has the length 9, and is 9 bytes long
> printf "%s" "${var}" |xxd
00000000: 655c 7863 635c 7838 31                   e\xcc\x81

Here's where I give up.

echo $var, echo ${var} and echo "${var}" all "correctly" emit three bytes. However, echo ${#var} tells me it's 9 charachters.

Where is this documented/standardized, what's the rules for all this?

Stéphane Chazelas · Accepted Answer · 2022-01-10T12:34:05.237

In POSIX compliant shells (not the Bourne shell, that feature comes from the Korn shell), ${#var} like wc -m counts the number of characters¹ in $var and the behaviour is unspecified if the sequence of bytes stored in $var cannot be decoded to characters in the current locale.

Bytes are decoded into characters as per the current locale (its LC_CTYPE category). In a locale that uses UTF-8 as the character encoding, the 0xc3 0xa9 sequence would be decoded into a é character, while in a locale using ISO8859-1, that would be decoded into Ã© and in a locale using BIG5 into 矇.

In any case, it has little to do with Unicode codepoints. It's also not the same as counting the number of grapheme clusters or the width of the string when displayed by a terminal or any other display device.

In:

var="e\xcc\x81"

$var contains 9 bytes and 9 characters: e, \, x, c, c, \, x, 8 and 1.

Some printf (in the format argument or in arguments for %b format directives) and echo implementations will expand \xcc to the 0xcc byte, not all do. Per POSIX, \x in an argument to those leads to unspecified behaviour. (\351 does expand to the 0xe9 byte in printf format argument and \0351 in echo/%b though).

If you want $var to contain the 0x65, 0xcc, 0x81 bytes, in ksh93/zsh/bash (and these days more and more shells), you'd do:

var=$'e\xcc\x81'

Or you could always do:

var=$(printf 'e\314\201')

Then in a locale where locale charmap outputs UTF-8, $var would contain 3 bytes (as shown by wc -c), 2 characters (as shown by wc -m or ${#var}), 1 grapheme cluster (as shown by GNU grep -Po '\X') usually displayed with width 1 (as shown by GNU wc -L).

If the locale at the time the shell was invoked and at the time the code was parsed and executed had UTF-8 as the charset, in several shells, you can also do:

var=$'e\u0301'

For $var to contain the UTF-8 encoding of the e and U+0301 (combining acute accent) characters.

If the locale's charset is not UTF-8, then the behaviour varies between shells. Also whether it's the locale that was in effect at the time the code was parsed or at the time the code was executed that is taken into account to expand the Unicode codepoint into a character depends on the shell. You'll find also variations of behaviour if the character is not present in the locale's charmap.

In the Bourne shell, to get the length in characters of a string, you had to resort to other utilities such as:

length=`expr "x$var" : '.*' - 1` || :

Or:

length=`printf %s "$var" | wc -m`

Though if you find a system old enough to still have a Bourne shell, chances are that its wc won't support -m or that there won't be a printf command.

^{¹ POSIX itself doesn't specify the mapping between sequences of bytes and characters, not even in the POSIX locale, only some APIs to define and retrieve that mapping or convert sequences of bytes to sequence of characters (wchar_t). Systems generally use standard charsets for the charmap like UTF-8 which is a transformation format of the charset defined by another ISO standard (ISO/IEC 10646 aka Unicode). Some systems like GNU ones actually use the Unicode code points for the wchar_t values regardless of the locale.}

nice! `locale charmap` tells me `UTF-8`, and `var=$'e\xcc\x81'; echo ${#var}` tells me 2. — Marcus Müller, Jan 09 '22 at 13:11
This might be slightly going off on a tangent, but what exactly *is* a character, then? — Marcus Müller, Jan 09 '22 at 13:12
@MarcusMüller see [the POSIX definition](https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap03.html#tag_03_87) for instance. — Stéphane Chazelas, Jan 09 '22 at 13:18
but.. but that would mean `è` is a single character, no matter how it was composed: *A sequence of one or more bytes representing a **single graphic symbol** or control code.* — Marcus Müller, Jan 09 '22 at 13:39
@MarcusMüller Two levels of abstraction, I think. From the POSIX perspective, both `e` and the composing grave accent would be considered two printable characters. It's the display device that would be responsible for *rendering* them as the single character `è`. — chepner, Jan 09 '22 at 20:56

What is "length" of a string in Bourne shell compatibles' `${#string}`?

1 Answers1

Linked