24

I have the Unicode character ᚠ, represented by its Unicode code point 16A0, in a text file (the text file is encoded(?) as utf-8).

When I do grep '\u16A0' test.txt I get no result. How do I grep that character?

Siva
  • 9,017
  • 8
  • 56
  • 86
Stupid
  • 343
  • 1
  • 2
  • 4
  • 1
    @pLumo Yes, that worked. Thanks. What does the $ character _before_ the regular expression do? – Stupid Jun 06 '19 at 14:39

2 Answers2

28

You can use ANSI-C quoting provided by your shell, to replace backslash-escaped characters as specified by the ANSI C standard. This should work for any command, not just grep, in shells like Bash and Zsh:

grep $'\u16A0'

For some more complex examples, you might refer to this related question and its answers.

Flimm
  • 3,970
  • 7
  • 28
  • 36
pLumo
  • 22,231
  • 2
  • 41
  • 66
  • 2
    Note that it's not ANSI C, the C language standard does not specify functionality of shells, and it was invented by David Korn for the Korn shell. https://unix.stackexchange.com/a/65819/5132 – JdeBP Jun 06 '19 at 18:40
  • 1
    except this ruins the regex if there is a control character – törzsmókus Jul 07 '21 at 08:59
6

You could use ugrep as a drop-in replacement of grep to match Unicode code point U+16A0:

ugrep '\x{16A0}' test.txt

It takes the same options as grep but offers vastly more features, such as:

ugrep searches UTF-8/16/32 input and other formats. Option -Q permits many other file formats to be searched, such as ISO-8859-1 to 16, EBCDIC, code pages 437, 850, 858, 1250 to 1258, MacRoman, and KIO8.

ugrep matches Unicode patterns by default (disabled with option -U). The regular expression pattern syntax is POSIX ERE compliant extended with PCRE-like syntax. Option -P may also be used for Perl matching with Unicode patterns.

See ugrep on GitHub for details.

Dr. Alex RE
  • 219
  • 2
  • 3