Is this a bug in 'tr'?

Asked Mar 23 '20 at 14:02

Active Mar 23 '20 at 14:23

Viewed 55 times

My terminal is set to UTF-8. If I type this:

echo -n 'ýá' | xxd

I can see this output:

00000000: c3bd c3a1

Which is fine. Now I would like to remove the 'ý' character from the string, so I use:

echo -n 'ýá' | tr -d 'ý' | xxd

But the result will be:

00000000: a1

The tr removes also the next c3 byte, but that is part of the 'á' character. Why is working this way? Is this a bug? Or should I set something?

edited Mar 23 '20 at 14:23

Jeff Schaller

asked Mar 23 '20 at 14:02

user401623

1

I think `tr` works on the byte level and the UTF-8 characters you show are represented by 2 bytes and one of those bytes is common in both UTF-8 characters. I think it is not a bug, but the way this primitive tool works. (You can test with some characters that are represented by single bytes and see what happens when some of them are represented twice in the pattern.) – sudodus Mar 23 '20 at 14:16
so `tr` is not prepared for UTF-8 ? – user3719454 Mar 23 '20 at 14:17
1

Related, if not a duplicate - [How to make tr aware of non-ascii(unicode) characters?](https://unix.stackexchange.com/a/228570/100397) and [tr analog for unicode characters?](https://unix.stackexchange.com/a/389641/100397) – roaima Mar 23 '20 at 14:34
yes, now it's clear – user3719454 Mar 23 '20 at 14:56

0 Answers0