Arising from this discussion:
When I have (zsh 5.8, bash 5.1.0)
var="ASCII"
echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"
the answer is simple: these are 5 characters, occupying five bytes.
Now, var=Müller yields
Müller has the length 6, and is 7 bytes long
Which suggests the ${#} operator counts codepoints, not bytes. This is a bit unclear in POSIX, where they say it counts "characters". This would be clearer if characters in POSIX C weren't octets, normally.
Anyways: Nice! Kind of good, seeing that LANG==en_US.utf8.
Now,
var='♀️'
echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"
♀️ has the length 5, and is 17 bytes long
Soooo, we decompose "Mermaid of dark skin color" into the Unicode codepoint
- Merperson
- Dark skin tone
- Zero-Width Joiner
- Female
- Print print the previous character as emoji
Fine, so we're really counting Unicode codepoints!
var="e\xcc\x81"
echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"
é has the length 9, and is 9 bytes long
(of course, my console font decided that the ´ combines with the following space, not the preceding e. The latter would be correct. But let's leave my rage about that for somewhen else.)
Um, a slight "wat" is in order here.
> printf "e\xcc\x81"|wc -c
3
> printf "%s" "${var}" |wc -c
9
> echo -n ${var} |wc -c
3
> echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"
é has the length 9, and is 9 bytes long
> printf "%s" "${var}" |xxd
00000000: 655c 7863 635c 7838 31 e\xcc\x81
Here's where I give up.
echo $var, echo ${var} and echo "${var}" all "correctly" emit three bytes. However, echo ${#var} tells me it's 9 charachters.
Where is this documented/standardized, what's the rules for all this?