Is there an alternative to sed that supports unicode?

Question

For example:

sed 's/\u0091//g' file1

Right now, I have to do hexdump to get hex number and put into sed as follows:

$ echo -ne '\u9991' | hexdump -C
00000000  e9 a6 91                                          |...|
00000003

And then:

$ sed 's/\xe9\xa6\x91//g' file1

score 35 · Answer 1 · edited Oct 17 '16 at 16:52

35

Just use that syntax:

sed 's/馑//g' file1

Or in the escaped form:

sed "s/$(echo -ne '\u9991')//g" file1

(Note that older versions of Bash and some shells do not understand echo -e '\u9991', so check first.)

edited Oct 17 '16 at 16:52

Flimm

3,970
7
28
36

answered Apr 17 '15 at 08:46

chaos

47,463
11
118
144

1

Does sed count 馑 as one character or 3? That is, does `echo 馑 | sed s/...//` print anything? – user253751 Apr 17 '15 at 11:22
@immibis Since `sed` has the g modifier it replaces all occurence also when they follow each other. Also sed should count it as one character, see: `echo -ne "馑" | wc -m` gives `1`. If you count the bytes (`wc -c`) it would return `3`. Did I understand your question correctly? – chaos Apr 17 '15 at 11:28
I meant: does `.` mean "one character" or "one byte"? – user253751 Apr 17 '15 at 11:30
@immibis I matches one character hence `echo 馑 | sed s/...//` gives me `馑` (nothing is replaced) – chaos Apr 17 '15 at 11:33
@chaos: I'm not getting the character back. Maybe the behaviour depends on locale? – choroba Apr 17 '15 at 12:08
@choroba You need UTF-8 encoding to have all Chinese characters displayed correctly. What says your `locale`? – chaos Apr 17 '15 at 12:23
4

@chaos: It works under `en_US.UTF-8`, but doesn't under `C`. – choroba Apr 17 '15 at 12:28
@user253751 if you're worried about characters counts, try Raku (formerly known as Perl_6): `echo '馑' | raku -ne '.chars.say;'` returns `1`. See: https://docs.raku.org/language/unicode – jubilatious1 Jul 12 '21 at 01:07

score 17 · Answer 2 · answered Apr 17 '15 at 08:50

17

Perl can do that:

echo 汉典“馑”字的基本解释 | perl -CS -pe 's/\N{U+9991}/Jin/g'

-CS turns on UTF-8 for standard input, output and error.

answered Apr 17 '15 at 08:50

choroba

45,735
7
84
110

9

Perl can do almost anything..... – wobbily_col Apr 17 '15 at 10:49
@wobbily_col Maybe. But it's written in C. So..if Perl can do almost anything C as the foundation of Perl can do anything. As it should be. – Pryftan Jan 16 '20 at 15:51
@wobbily_col In Raku (aka Perl6): `echo 汉典“馑”字的基本解释 | raku -pe 's:g/\x9991/Jin/'` #OUTPUT `汉典“Jin”字的基本解释`. – jubilatious1 Sep 29 '21 at 16:40

score 7 · Answer 3 · answered Apr 17 '15 at 12:54

7

A number of versions of sed support Unicode:

Heirloom sed, which is based on "original Unix material".
GNU sed, which is its own codebase.
Plan 9 sed, which has been ported to Unix-like operating systems.

I couldn't find information on BSD sed, which I thought was strange, but I think the odds are good that it supports Unicode too. Unfortunately, there is no standard way to tell sed which encoding to use, so each one does this in its own ways.

answered Apr 17 '15 at 12:54

The Spooniest

361
2
1

Do they support UTF-16 with and without BOM ? – Bon Ami Apr 17 '15 at 17:12
12

UTF-16 is pretty unusable in Unix-based OSes. It's also an abomination that should have never seen the light of day. – Brian Bi Apr 17 '15 at 19:11
Whether or not they support UTF-16 depends on the implementation, and I'm afraid I don't have that data. I doubt that Plan 9 sed does (the original OS is UTF-8 everywhere), but I can't be sure, and even if it doesn't, the others might. – The Spooniest Apr 17 '15 at 19:30

score 7 · Answer 4 · answered Oct 02 '19 at 06:17

7

With recent versions of BASH, just omit the quotes around the sed expression and you can use BASH's escaped strings. Spaces within the sed expression or parts of the sed expression that might be interpreted by BASH as wildcards can be individually quoted.

$ echo "饥馑荐臻" | sed s/$'\u9991'//g
饥荐臻

answered Oct 02 '19 at 06:17

Dave Rove

1,235
1
12
9

2

This should be the new accepted answer, simple and clean! – Allen Wang Nov 06 '19 at 22:28
2

@AllenWang For reference, the `$'...'` type of quotes comes from ksh93 in 1993 while the `\uxxxx` within them comes from zsh in 2003 (inspired from GNU `printf`). Added in bash in 4.2 in 2010. So unless you're on macos which still comes with 3.2, that answer would have also been valid in 2015 when that question was asked. – Stéphane Chazelas Jun 22 '21 at 05:54

score 5 · Answer 5 · answered Apr 17 '18 at 18:21

5

This works for me:

$ vim -nEs +'%s/\%u9991//g' +wq file1

It’s a drop more verbose than I’d like; here’s a full explanation:

-n disable vim swap file
-E Ex improved mode
-s silent mode
+'%s/\%u9991//g' execute the substitution command
+wq save and exit

answered Apr 17 '18 at 18:21

Aryeh Leib Taurog

543
5
10

I suppose this modifies `file1` *in-place*, is that correct? – gerrit Jan 10 '19 at 10:32
@gerrit that’s correct, and thanks for pointing it out. – Aryeh Leib Taurog Jan 10 '19 at 19:21

score 1 · Answer 6 · answered Apr 17 '15 at 10:16

1

Works for me with GNU sed (version 4.2.1):

$ echo -ne $'\u9991' | sed 's/\xe9\xa6\x91//g' | hexdump -C
$ echo -ne $'\u9991' | hexdump -C
00000000  e9 a6 91

(As another replacement for sed you could also use GNU awk; but it don't seem necessary.)

answered Apr 17 '15 at 10:16

Janis

14,014
3
25
42

score 0 · Answer 7 · answered Jun 21 '21 at 21:54

Using Raku (formerly known as Perl_6)

~$ echo 汉典“馑”字的基本解释 | raku -pe 's:g/\x9991/Jin/;'
汉典“Jin”字的基本解释
~$ echo "饥馑荐臻" | raku -pe s:g/'\x9991'//;
饥荐臻

~$ raku -e 'print "e", "e\x301", "\x000e9";'
eéé
~$ raku -e 'say "e\x301" eq "\x000e9";'
True
~$ echo "Stephane" | raku -pe 's/e/e\x301/;'
Stéphane
~$ echo "Stephane" | raku -pe 's/e/\x000e9/;'
Stéphane

[Rakudo 2020.10; code tested on GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14]

https://raku.org/

Is there an alternative to sed that supports unicode?

7 Answers7

Linked