38

In Unicode, some character combinations have more than one representation.

For example, the character ä can be represented as

  • "ä", that is the codepoint U+00E4 (two bytes c3 a4 in UTF-8 encoding), or as
  • "ä", that is the two codepoints U+0061 U+0308 (three bytes 61 cc 88 in UTF-8).

According to the Unicode standard, the two representations are equivalent but in different "normalization forms", see UAX #15: Unicode Normalization Forms.

The unix toolbox has all kinds of text transformation tools, sed, tr, iconv, Perl come to mind. How can I do quick and easy NF conversion on the command-line?

glts
  • 572
  • 1
  • 4
  • 12
  • 3
    Looks like there is a "Unicode::Normalization" module for perl which should do this kind of thing: http://search.cpan.org/~sadahiro/Unicode-Normalize-1.16/Normalize.pm – goldilocks Sep 10 '13 at 19:36
  • @goldilocks if it had a CLI… I mean, I do `perl -MUnicode::Normalization -e 'print NFC(`… er what comes here now… – mirabilos Nov 15 '16 at 14:33

7 Answers7

39

You can use the uconv utility from ICU. Normalization is achieved through transliteration (-x).

$ uconv -x any-nfd <<<ä | hd
00000000  61 cc 88 0a                                       |a...|
00000004
$ uconv -x any-nfc <<<ä | hd
00000000  c3 a4 0a                                          |...|
00000003

On Debian, Ubuntu and other derivatives, uconv is in the libicu-dev package. On Fedora, Red Hat and other derivatives, and in BSD ports, it's in the icu package.

niels
  • 133
  • 6
Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
  • This works, thanks. You have to install a 30M dev library alongside it though. What's worse, I haven't been able to find proper documentation for uconv itself: where did you find `any-nfd`? It looks like development of this tool has been abandoned, last update was in 2005. – glts Sep 14 '13 at 16:07
  • 3
    @glts I found `any-nfd` by browsing through the list displayed by `uconv -L`. – Gilles 'SO- stop being evil' Sep 14 '13 at 23:38
  • 1
    On Ubuntu using `sudo apt install icu-devtools` to run `uconv -x any-nfc`, but **not solve the simplest problem**, e.g. a `bugText.txt` file with *"Iglésias, Bad-á, Good-á"* converted by `uconv -x any-nfc bugText.txt > goodText.txt` stay the same text. – Peter Krauss Nov 16 '18 at 11:40
  • @PeterKrauss Did that very test (Ubuntu 22-04.1), `hd file` before uconv shows the composite chars, `hd` after shows that it's been fixed... Worked as intended. – Déjà vu Feb 16 '23 at 03:43
9

Python has unicodedata module in its standard library, which allow translating Unicode representations through unicodedata.normalize() function:

import unicodedata
 
s1 = 'Spicy Jalape\u00f1o'
s2 = 'Spicy Jalapen\u0303o'

t1 = unicodedata.normalize('NFC', s1)
t2 = unicodedata.normalize('NFC', s2)
print(t1 == t2) 
print(ascii(t1)) 
 
t3 = unicodedata.normalize('NFD', s1)
t4 = unicodedata.normalize('NFD', s2)
print(t3 == t4)
print(ascii(t3))

Running with Python 3.x:

$ python3 test.py
True
'Spicy Jalape\xf1o'
True
'Spicy Jalapen\u0303o'

Python isn't well suited for shell one-liners, but it can be done if you don't want to create external script:

$ python3 -c $'import unicodedata\nprint(unicodedata.normalize("NFC", "ääääää"))'
ääääää

For Python 2.x you have to add encoding line (# -*- coding: utf-8 -*-) and mark strings as Unicode with u character:

$ python -c $'# -*- coding: utf-8 -*-\nimport unicodedata\nprint(unicodedata.normalize("NFC", u"ääääää"))'
ääääää
Pablo A
  • 2,307
  • 1
  • 22
  • 34
Nykakin
  • 3,919
  • 1
  • 19
  • 18
4

Check it with the tool hexdump:

echo  -e "ä\c" |hexdump -C 

00000000  61 cc 88                                          |a..|
00000003  

convert with iconv and check again with hexdump:

echo -e "ä\c" | iconv -f UTF-8-MAC -t UTF-8 |hexdump -C

00000000  c3 a4                                             |..|
00000002

printf '\xc3\xa4'
ä
mtt2p
  • 141
  • 3
  • 6
    This only works on macOS. There is no 'utf-8-mac' on Linux, on FreeBSDs, etc. Also, decomposition by using this encoding does not follow the specification (it does follow the macOS filesystem normalization algorithm though). More info: http://search.cpan.org/~tomita/Encode-UTF8Mac-0.04/lib/Encode/UTF8Mac.pm – antonone Feb 14 '17 at 11:56
  • @antonone to be fair though there was no OS specified in the question. – roaima Sep 15 '17 at 07:47
  • 3
    @roaima Yes, that's why I've assumed that the answer should work on all systems that are based on Unix/Linux. The answer above works only on macOS. If one's looking for a macOS-specific answer, then it'll work, in part. I just wanted to point that out, because the other day I've lost some time wondering why I have no `utf-8-mac` on Linux and if this is normal. – antonone Sep 15 '17 at 10:55
4

For completeness, with perl:

$ perl -CSA -MUnicode::Normalize=NFD -e 'print NFD($_) for @ARGV' $'\ue1' | uconv -x name
\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}
$ perl -CSA -MUnicode::Normalize=NFC -e 'print NFC($_) for @ARGV' $'a\u301' | uconv -x name
\N{LATIN SMALL LETTER A WITH ACUTE}
Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
3

coreutils has a patch to get a proper unorm. works fine for me on 4byte wchars. follow http://crashcourse.housegordon.org/coreutils-multibyte-support.html#unorm The remaining problem there are 2-byte wchar systems (cygwin, windows, plus aix and solaris on 32bit), which need to transform codepoints from upper planes into surrogate pairs and vice versa, and the underlying libunistring/gnulib cannot handle that yet.

I do maintain these patches at https://github.com/rurban/coreutils/tree/multibyte

perl has the unichars tool, which also does the various normalization forms on the cmdline. http://search.cpan.org/dist/Unicode-Tussle/script/unichars

rurban
  • 131
  • 4
2

There's a perl utility called Charlint available from

https://www.w3.org/International/charlint/

which does what you want. You'll also have to download a file from

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

After the first run you'll see Charlint complaining about incompatible entries in that file so you'll have to delete those lines from UnicodeData.txt.

0

Since uconv doesn't seem to be well documented, and the python solution posted here isn't actually a one-liner, here's a one-liner using ruby:

ruby -e '$stdin.each_line {|line| puts line.unicode_normalize(:nfd)}' <infile >outfile

Documentation: https://apidock.com/ruby/v2_5_5/String/unicode_normalize