41

I have a file in UTF-8 that contains texts in multiple languages. A lot of it are people's names. I need to convert it to ASCII and I need the result to look as decent as possible.

There are many ways how to approach converting from a wider encoding to a narrower one. The simplest transformation would be to replace all non-ASCII characters with some placeholder, like '_'. If I know the language the file is written in, there are additional possibilities, like romanization.

What Unix tool or programming language library available on Unix can give me a decent (best-effort) conversion from UTF-8 to ASCII?

Most of the text is in European, latin type based languages.

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
user7610
  • 1,878
  • 2
  • 18
  • 22
  • 1
    do you know where which language starts? There is e.g. a difference on how to handle non-availability of an umlaut (as on the ö). In German you can always write "oe", but e.g. in Dutch the unavailability of an umlaut can better be "described" by a dash followed by the umlauted character (and there the "oe" would be a completely different diphthong) – Anthon Dec 06 '14 at 17:03
  • How do you define “as decent as possible”? The real difficulty is in defining the mappings. Compared to that, the programming task is trivial. The mappings actually used vary a lot and may be language-specific in two ways: they depend on the language of the text and on the assumed language of the reader (especially as regards to romanization). – Jukka K. Korpela Dec 06 '14 at 22:54
  • @JukkaK.Korpela "as decent as possible" is of course defined by those who created the "Unix tool or programming language library available on Unix" that I am asking for. If the best I am gonna get is replacing everything non-ASCII with an underscore, then there is not much else I can do. Except writing my own tool, which I won't. I guess Unix@SO might not be the best place for this question… – user7610 Dec 06 '14 at 23:23
  • @user7610, that requirement does not make sense then. If there are two tools for the purpose, how are you going to decide which is better, if decency is defined in terms of what the authors of the tool think? You should specify the *purpose* and context of the mapping. Even then, the primary question to be resolved is which *principlies* are best; only then can you evaluate tools. – Jukka K. Korpela Dec 07 '14 at 07:33
  • 1
    @user7610 Other than `iconv` and `tr`, there is [Unidecode](https://pypi.python.org/pypi/Unidecode/). I am not familiar with it, but it might do what you want, if you can use Python. – yellowantphil Dec 07 '14 at 23:56
  • 1
    @yellowantphil or [node-unidecode](https://github.com/FGRibreau/node-unidecode) in JavaScript/node, UnidecodeSharp in C♯, or [Text::Unidecode](http://search.cpan.org/~sburke/Text-Unidecode-1.23/lib/Text/Unidecode.pm) in Perl, which happens to be first of this name. I guess there are other versions. – user7610 Dec 08 '14 at 11:27
  • 1
    An alternative `translit` command is suggested at https://askubuntu.com/a/1132121 – Nemo Apr 25 '19 at 16:37

5 Answers5

46

This will work for some things:

iconv -f utf-8 -t ascii//TRANSLIT

echo ĥéĺłœ π | iconv -f utf-8 -t ascii//TRANSLIT returns helloe ?. Any characters that iconv doesn’t know how to convert will be replaced with question marks.

iconv is POSIX, but I don’t know if all systems have the TRANSLIT option. It works for me on Linux. Also, the IGNORE option will silently discard characters that cannot be represented in the target character set (see man iconv_open).

An inferior but POSIX-compliant option is to use tr. This command replaces all non-ASCII code points with a question mark. It reads UTF-8 text one byte at a time. “É” might be replaced with E? or ?, depending on whether it was encoded using a combining accent or a precomposed character.

echo café äëïöü | tr -d '\200-\277' | tr '\300-\377' '[?*]'

That example returns caf? ?????, using precomposed characters.

user664833
  • 199
  • 1
  • 5
yellowantphil
  • 897
  • 2
  • 8
  • 20
19
konwert utf8-ascii

It will do best-effort conversion, depending on the conversion tables. If you know approximately the input language, there are language specific filters giving better results, e.g.

konwert utf8-xmetodo

is the conversion of Esperanto into the x-metodo representation,

konwert UTF8-tex

will try to do TeX representation of diacritics, there are language specific parameters:

konwert UTF8-ascii/de

will transliterate "ä" into "ae" (customary for German) instead of plain "a"

konwert UTF8-ascii/rosyjski

will use Polish rules for transliterating Russian, instead of the "English-like" ones, etc...

Radovan Garabík
  • 1,833
  • 10
  • 15
  • 3
    Is this the latest location of the `konwert` website? Is it packaged anywhere? https://github.com/taw/konwert/tree/master/konwert-1.8 – Nemo Apr 25 '19 at 16:36
  • 1
    @Nemo It is available as a [Debian package](https://packages.debian.org/search?keywords=konwert&searchon=names&exact=1&suite=stable&section=all). – user5534993 Jan 27 '21 at 17:02
  • Nice, apparently Arch Linux too (and of course the various Debian downstreams) https://repology.org/project/konwert/versions – Nemo Jan 28 '21 at 09:29
7

try uni2ascii -B input.txt >output.txt

uni2ascii

philcolbourn
  • 336
  • 3
  • 3
4

I ended up using Perl with Text::Unidecode for this. There are ports of the library for other languages.

Examples of a few difficult cases:

perl -e 'use utf8; use Text::Unidecode; print unidecode("عبد الله الثاني بسين")

produces bd llh lthny bn lHsyn, which is acceptable result for my purposes.

It can even do Chinese characters, to some degree:

$ perl -e 'use utf8; use Text::Unidecode; print unidecode("工廠")'
Gong Chang
$ perl -e 'use utf8; use Text::Unidecode; print unidecode("工厂")'
Gong Han
user7610
  • 1,878
  • 2
  • 18
  • 22
1

I have a file in UTF-8 that contains [people's names] in multiple languages [that I want to convert to something meaningfull in ASCII].

You mean you want to be able to convert the following names into some ASCII string the person concerned would not object to?

  • ஸ்றீனிவாஸ ராமானுஜன் ஐயங்கார்
  • عبد الله الثاني بن الحسين

I suspect there is no automated tool that can do this. There can be either no or very many Latinizations of personal names. Software cannot choose the culturally acceptable version. At least not without the software knowing a lot about the culture of the person involved.

See also https://stackoverflow.com/a/1398403/477035

RedGrittyBrick
  • 2,089
  • 20
  • 22
  • 2
    `perl -e 'use utf8; use Text::Unidecode; print unidecode("عبد الله الثاني بسين")'` produces ``bd llh lthny bn lHsyn` which is good enough transliteration for my purposes. – user7610 Sep 02 '15 at 15:53
  • 4
    @user7610: Fine but *King Abdulla II of Jordan* might disagree. I would prepare an explanation in case someone important complains to the CEO :-) – RedGrittyBrick Sep 02 '15 at 15:56