3

I'd like to simplify the example, but any further simplification removes the problem … I already stripped my script down to this:

#!/bin/sh
echo "$1" | sed 's/[^[:alnum:]]//g'

This does what I expect it to do when called directly, but not as part of a find -exec:

$ cp "Motörhead "{1,2}
$ ./foo.sh M*1
Motörhead1
$ ./foo.sh M*2
Motörhead2
$ find . -name "M*" -exec ./foo.sh {} \;
Motorhead1
Motörhead2

Everything is fine when called directly, but as part of the -exec command, the Umlaut gets messed up, at least sometimes. The difference? Motörhead 1 was created by the finder, while Motörhead 2 was created by the shell. It's like find has a problem to detect the character encoding of the file names created by the finder.

  • If I replace the second script line with name="Motörhead", the problem is gone
  • Reproducable on apfs and exfat file system and afp mount
  • I'm on MacOS Catalina in Terminal.app with zsh
  • locale is de_DE.UTF-8
Philippos
  • 13,237
  • 2
  • 37
  • 76
  • Can confirm, yes, if a filename is created by the Finder, the `ö` comes out as `o` with the `find` command. – Kusalananda Mar 20 '20 at 19:37
  • Another interesting thing is that if you create `Motörhead 3` in the Finder, this will sort before `Motörhead 2` created in the shell in the output of `ls -l`. Piping the output of `printf '%s\n' M*` to `od -h` shows that the `ö` is the same in all three filenames (`b6c3`). – Kusalananda Mar 20 '20 at 19:47
  • Yes, some layer of the system seems to do ugly things (maybe to provide backwards compatibility to 1984), which works almost all the time … almost – Philippos Mar 20 '20 at 19:51
  • 1
    I have read that MacOS X stores characters such as `ö`, as composing characters. That is as two characters: `composing "`, followed by `o`. Probably this does not apply to the Unix layer. Can you do a test to see if this is true? – ctrl-alt-delor Mar 21 '20 at 09:46
  • Excellent point, @ctrl-alt-delor! As written in my comment below, doing `ls M* | tr o x` gives me something like `Mxtẍrhead`, so both `o` are replaced, while the ` ̈` is now attached to the `x`. It seems, they worked around the problem in many cases, but failed to cover all, resulting in strange inconsistencies. – Philippos Mar 21 '20 at 09:52
  • Here comes an old Apple link to both ways of unicode encoding and how to convert in C: http://mirror.informatimago.com/next/developer.apple.com/qa/qa2001/qa1235.html – Philippos Mar 21 '20 at 09:54

2 Answers2

1

I have read that MacOS X stores characters such as ö, as composing/combining characters. That is as two characters: o, followed by combining ". Probably this does not apply to the Unix layer.

I can reproduce on Debian Gnu/Linux: echo Åström | sed 's/[^[:alnum:]]//g' -- goto https://en.wikipedia.org/wiki/Precomposed_character#Comparing_precomposed_and_decomposed_characters and paste the two alternate versions of Åström. The one using combining characters drops the accents.

It is as if sed is seeing the combining characters as if they are just non-alpha-numeric characters.

A work around

As a work around, pipe the file names through

iconv -f utf-8-mac -t utf-8

Text examined with od

Done on Debian Gnu/Linux, using Konsole terminal, and bash shell, plasma-desktop, and pasted from chrome browser.

#↳ echo  Åström composing | od -tax1
0000000   A   L  nl   s   t   r   o   L  bs   m  sp   c   o   m   p   o
         41  cc  8a  73  74  72  6f  cc  88  6d  20  63  6f  6d  70  6f
0000020   s   i   n   g  nl
         73  69  6e  67  0a
0000025

#↳ echo  Åström composed | od -tax1
0000000   C enq   s   t   r   C   6   m  sp   c   o   m   p   o   s   e
         c3  85  73  74  72  c3  b6  6d  20  63  6f  6d  70  6f  73  65
0000020   d  nl
         64  0a
0000022
ctrl-alt-delor
  • 27,473
  • 9
  • 58
  • 102
  • How come feeding the filenames into `od` shows the exact same bytes for all `ö` characters? ([see comment](https://unix.stackexchange.com/questions/573982/file-name-character-encoding-gets-confused-when-called-with-find-exec#comment1068550_573982)) – Kusalananda Mar 21 '20 at 10:02
  • Thanks to everyone how contributed to finding the cause of the phenomenon! – Philippos Mar 21 '20 at 10:09
0

It is likely the fault of sed.

However I cannot reproduce your results on macOS 10.13.6, where the ö is always translated to a plain o, and this happens with /usr/bin/sed as well as GNU Sed.

On a newer FreeBSD machine the ö is preserved in all cases provided the locale ends in .UTF-8. I seem to recall there were bugs in FreeBSD's initial attempts to internationalize Sed. macOS typically uses older versions of BSD code and probably has not picked up the recent fixes.

On NetBSD with a more traditional BSD sed, the ö is removed as if it were not an alphanumeric character.

Greg A. Woods
  • 793
  • 5
  • 9
  • How does `sed` know how it is called? I think it just makes it visible. If I use `| tr o x` instead, I get `Mxtẍrhead`. Thank you anyhow for the comments. – Philippos Mar 21 '20 at 09:41
  • It doesn't, at least not in your example. As I said, I cannot reproduce your results on macOS 10.13.6. As the answer by @ctrl-alt-delor suggests, your example is not an accurate description and reproduction of your problem, and what I suggest about the bugs in `sed` (and/or the regex code it uses) are the root problem while differences in what you did vs. what you said you did account for the odd behaviour you saw. (e.g. I don't think you used `cp`, but `touch` would work in its place in your example) – Greg A. Woods Mar 21 '20 at 17:11