43

For example:

sed 's/\u0091//g' file1

Right now, I have to do hexdump to get hex number and put into sed as follows:

$ echo -ne '\u9991' | hexdump -C
00000000  e9 a6 91                                          |...|
00000003

And then:

$ sed 's/\xe9\xa6\x91//g' file1
chaos
  • 47,463
  • 11
  • 118
  • 144
A-letubby
  • 699
  • 2
  • 6
  • 6

7 Answers7

35

Just use that syntax:

sed 's/馑//g' file1

Or in the escaped form:

sed "s/$(echo -ne '\u9991')//g" file1

(Note that older versions of Bash and some shells do not understand echo -e '\u9991', so check first.)

Flimm
  • 3,970
  • 7
  • 28
  • 36
chaos
  • 47,463
  • 11
  • 118
  • 144
  • 1
    Does sed count 馑 as one character or 3? That is, does `echo 馑 | sed s/...//` print anything? – user253751 Apr 17 '15 at 11:22
  • @immibis Since `sed` has the g modifier it replaces all occurence also when they follow each other. Also sed should count it as one character, see: `echo -ne "馑" | wc -m` gives `1`. If you count the bytes (`wc -c`) it would return `3`. Did I understand your question correctly? – chaos Apr 17 '15 at 11:28
  • I meant: does `.` mean "one character" or "one byte"? – user253751 Apr 17 '15 at 11:30
  • @immibis I matches one character hence `echo 馑 | sed s/...//` gives me `馑` (nothing is replaced) – chaos Apr 17 '15 at 11:33
  • @chaos: I'm not getting the character back. Maybe the behaviour depends on locale? – choroba Apr 17 '15 at 12:08
  • @choroba You need UTF-8 encoding to have all Chinese characters displayed correctly. What says your `locale`? – chaos Apr 17 '15 at 12:23
  • 4
    @chaos: It works under `en_US.UTF-8`, but doesn't under `C`. – choroba Apr 17 '15 at 12:28
  • @user253751 if you're worried about characters counts, try Raku (formerly known as Perl_6): `echo '馑' | raku -ne '.chars.say;'` returns `1`. See: https://docs.raku.org/language/unicode – jubilatious1 Jul 12 '21 at 01:07
17

Perl can do that:

echo 汉典“馑”字的基本解释 | perl -CS -pe 's/\N{U+9991}/Jin/g'

-CS turns on UTF-8 for standard input, output and error.

choroba
  • 45,735
  • 7
  • 84
  • 110
7

A number of versions of sed support Unicode:

  • Heirloom sed, which is based on "original Unix material".
  • GNU sed, which is its own codebase.
  • Plan 9 sed, which has been ported to Unix-like operating systems.

I couldn't find information on BSD sed, which I thought was strange, but I think the odds are good that it supports Unicode too. Unfortunately, there is no standard way to tell sed which encoding to use, so each one does this in its own ways.

The Spooniest
  • 361
  • 2
  • 1
  • Do they support UTF-16 with and without BOM ? – Bon Ami Apr 17 '15 at 17:12
  • 12
    UTF-16 is pretty unusable in Unix-based OSes. It's also an abomination that should have never seen the light of day. – Brian Bi Apr 17 '15 at 19:11
  • Whether or not they support UTF-16 depends on the implementation, and I'm afraid I don't have that data. I doubt that Plan 9 sed does (the original OS is UTF-8 everywhere), but I can't be sure, and even if it doesn't, the others might. – The Spooniest Apr 17 '15 at 19:30
7

With recent versions of BASH, just omit the quotes around the sed expression and you can use BASH's escaped strings. Spaces within the sed expression or parts of the sed expression that might be interpreted by BASH as wildcards can be individually quoted.

$ echo "饥馑荐臻" | sed s/$'\u9991'//g
饥荐臻
Dave Rove
  • 1,235
  • 1
  • 12
  • 9
  • 2
    This should be the new accepted answer, simple and clean! – Allen Wang Nov 06 '19 at 22:28
  • 2
    @AllenWang For reference, the `$'...'` type of quotes comes from ksh93 in 1993 while the `\uxxxx` within them comes from zsh in 2003 (inspired from GNU `printf`). Added in bash in 4.2 in 2010. So unless you're on macos which still comes with 3.2, that answer would have also been valid in 2015 when that question was asked. – Stéphane Chazelas Jun 22 '21 at 05:54
5

This works for me:

$ vim -nEs +'%s/\%u9991//g' +wq file1

It’s a drop more verbose than I’d like; here’s a full explanation:

  • -n disable vim swap file
  • -E Ex improved mode
  • -s silent mode
  • +'%s/\%u9991//g' execute the substitution command
  • +wq save and exit
1

Works for me with GNU sed (version 4.2.1):

$ echo -ne $'\u9991' | sed 's/\xe9\xa6\x91//g' | hexdump -C
$ echo -ne $'\u9991' | hexdump -C
00000000  e9 a6 91

(As another replacement for sed you could also use GNU awk; but it don't seem necessary.)

Janis
  • 14,014
  • 3
  • 25
  • 42
0

Using Raku (formerly known as Perl_6)

~$ echo 汉典“馑”字的基本解释 | raku -pe 's:g/\x9991/Jin/;'
汉典“Jin”字的基本解释
~$ echo "饥馑荐臻" | raku -pe s:g/'\x9991'//;
饥荐臻

~$ raku -e 'print "e", "e\x301", "\x000e9";'
eéé
~$ raku -e 'say "e\x301" eq "\x000e9";'
True
~$ echo "Stephane" | raku -pe 's/e/e\x301/;'
Stéphane
~$ echo "Stephane" | raku -pe 's/e/\x000e9/;'
Stéphane

[Rakudo 2020.10; code tested on GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14]

https://raku.org/

jubilatious1
  • 2,385
  • 8
  • 16