Converting a UTF-8 file to ASCII (best-effort)

Question

I have a file in UTF-8 that contains texts in multiple languages. A lot of it are people's names. I need to convert it to ASCII and I need the result to look as decent as possible.

There are many ways how to approach converting from a wider encoding to a narrower one. The simplest transformation would be to replace all non-ASCII characters with some placeholder, like '_'. If I know the language the file is written in, there are additional possibilities, like romanization.

What Unix tool or programming language library available on Unix can give me a decent (best-effort) conversion from UTF-8 to ASCII?

Most of the text is in European, latin type based languages.

do you know where which language starts? There is e.g. a difference on how to handle non-availability of an umlaut (as on the ö). In German you can always write "oe", but e.g. in Dutch the unavailability of an umlaut can better be "described" by a dash followed by the umlauted character (and there the "oe" would be a completely different diphthong) — Anthon, Dec 06 '14 at 17:03
How do you define “as decent as possible”? The real difficulty is in defining the mappings. Compared to that, the programming task is trivial. The mappings actually used vary a lot and may be language-specific in two ways: they depend on the language of the text and on the assumed language of the reader (especially as regards to romanization). — Jukka K. Korpela, Dec 06 '14 at 22:54
@JukkaK.Korpela "as decent as possible" is of course defined by those who created the "Unix tool or programming language library available on Unix" that I am asking for. If the best I am gonna get is replacing everything non-ASCII with an underscore, then there is not much else I can do. Except writing my own tool, which I won't. I guess Unix@SO might not be the best place for this question… — user7610, Dec 06 '14 at 23:23
@user7610, that requirement does not make sense then. If there are two tools for the purpose, how are you going to decide which is better, if decency is defined in terms of what the authors of the tool think? You should specify the *purpose* and context of the mapping. Even then, the primary question to be resolved is which *principlies* are best; only then can you evaluate tools. — Jukka K. Korpela, Dec 07 '14 at 07:33
@user7610 Other than `iconv` and `tr`, there is [Unidecode](https://pypi.python.org/pypi/Unidecode/). I am not familiar with it, but it might do what you want, if you can use Python. — yellowantphil, Dec 07 '14 at 23:56
@yellowantphil or [node-unidecode](https://github.com/FGRibreau/node-unidecode) in JavaScript/node, UnidecodeSharp in C♯, or [Text::Unidecode](http://search.cpan.org/~sburke/Text-Unidecode-1.23/lib/Text/Unidecode.pm) in Perl, which happens to be first of this name. I guess there are other versions. — user7610, Dec 08 '14 at 11:27
An alternative `translit` command is suggested at https://askubuntu.com/a/1132121 — Nemo, Apr 25 '19 at 16:37

score 46 · Answer 1 · edited Sep 10 '19 at 23:11

46

This will work for some things:

iconv -f utf-8 -t ascii//TRANSLIT

echo ĥéĺłœ π | iconv -f utf-8 -t ascii//TRANSLIT returns helloe ?. Any characters that iconv doesn’t know how to convert will be replaced with question marks.

iconv is POSIX, but I don’t know if all systems have the TRANSLIT option. It works for me on Linux. Also, the IGNORE option will silently discard characters that cannot be represented in the target character set (see man iconv_open).

An inferior but POSIX-compliant option is to use tr. This command replaces all non-ASCII code points with a question mark. It reads UTF-8 text one byte at a time. “É” might be replaced with E? or ?, depending on whether it was encoded using a combining accent or a precomposed character.

echo café äëïöü | tr -d '\200-\277' | tr '\300-\377' '[?*]'

That example returns caf? ?????, using precomposed characters.

edited Sep 10 '19 at 23:11

user664833

199
1
5

answered Dec 07 '14 at 00:40

yellowantphil

897
2
8
20

1

`tr` is not meant to work one byte at a time. GNU tr does, but it's a bug. – Stéphane Chazelas Oct 09 '15 at 15:46
4

`iconv -f utf-8 -t ascii//TRANSLIT` worked well for me. It changed curly quotes to straight quotes. Thanks. – Colonel Panic Mar 30 '16 at 15:37
1

Note that iconv will choke on heavily accented characters such as Pinyin. – sventechie Dec 29 '16 at 19:10
1

Note that `//TRANSLIT` also works for other sets of characters, e.g. `iso-8859-1//TRANSLIT`. – Skippy le Grand Gourou Jun 19 '18 at 14:33
`iconv` gives `iconv: illegal input sequence at position 1234` and truncates the file for me. Would be nice if it just deleted the character and tried to pick up the sequence again. – jozxyqk Jul 18 '18 at 23:21
3

`-c` option to `iconv` silentlyl discards characters that cannot be converted instead of terminating – mykel May 26 '21 at 04:16
`//TRANSLIT` replaces all Cyrillic / Greek letters with `?` unless you set up a custom locale with the appropriate conversion table. For some reason, this is not done by default. – Dmitry Grigoryev Jun 30 '21 at 22:33

Radovan Garabík · Accepted Answer · 2015-10-09T21:01:37.283

19

konwert utf8-ascii

It will do best-effort conversion, depending on the conversion tables. If you know approximately the input language, there are language specific filters giving better results, e.g.

konwert utf8-xmetodo

is the conversion of Esperanto into the x-metodo representation,

konwert UTF8-tex

will try to do TeX representation of diacritics, there are language specific parameters:

konwert UTF8-ascii/de

will transliterate "ä" into "ae" (customary for German) instead of plain "a"

konwert UTF8-ascii/rosyjski

will use Polish rules for transliterating Russian, instead of the "English-like" ones, etc...

edited Oct 09 '15 at 21:01

answered Oct 09 '15 at 15:36

Radovan Garabík

1,833
10
15

3

Is this the latest location of the `konwert` website? Is it packaged anywhere? https://github.com/taw/konwert/tree/master/konwert-1.8 – Nemo Apr 25 '19 at 16:36
1

@Nemo It is available as a [Debian package](https://packages.debian.org/search?keywords=konwert&searchon=names&exact=1&suite=stable&section=all). – user5534993 Jan 27 '21 at 17:02
Nice, apparently Arch Linux too (and of course the various Debian downstreams) https://repology.org/project/konwert/versions – Nemo Jan 28 '21 at 09:29

score 7 · Answer 3 · answered Jul 20 '17 at 11:04

7

try uni2ascii -B input.txt >output.txt

uni2ascii

answered Jul 20 '17 at 11:04

philcolbourn

336
3
3

Worked well for French on MacOS. – Andriy Makukha Feb 15 '20 at 17:45
ok, but what is the command line if someone wants to convert multiple text files from a particular folder? Can be converted from UNI to ASCII in the same files? – Just Me Sep 24 '22 at 12:42

user7610 · Answer 4 · 2023-04-18T09:38:29.120

I ended up using Perl with Text::Unidecode for this. There are ports of the library for other languages.

Examples of a few difficult cases:

perl -e 'use utf8; use Text::Unidecode; print unidecode("عبد الله الثاني بسين")

produces bd llh lthny bn lHsyn, which is acceptable result for my purposes.

It can even do Chinese characters, to some degree:

$ perl -e 'use utf8; use Text::Unidecode; print unidecode("工廠")'
Gong Chang
$ perl -e 'use utf8; use Text::Unidecode; print unidecode("工厂")'
Gong Han

score 1 · Answer 5 · edited May 23 '17 at 12:39

1

I have a file in UTF-8 that contains [people's names] in multiple languages [that I want to convert to something meaningfull in ASCII].

You mean you want to be able to convert the following names into some ASCII string the person concerned would not object to?

ஸ்றீனிவாஸ ராமானுஜன் ஐயங்கார்
عبد الله الثاني بن الحسين

I suspect there is no automated tool that can do this. There can be either no or very many Latinizations of personal names. Software cannot choose the culturally acceptable version. At least not without the software knowing a lot about the culture of the person involved.

See also https://stackoverflow.com/a/1398403/477035

edited May 23 '17 at 12:39

Community

1

answered Sep 02 '15 at 15:07

RedGrittyBrick

2,089
20
22

2

`perl -e 'use utf8; use Text::Unidecode; print unidecode("عبد الله الثاني بسين")'` produces ``bd llh lthny bn lHsyn` which is good enough transliteration for my purposes. – user7610 Sep 02 '15 at 15:53
4

@user7610: Fine but *King Abdulla II of Jordan* might disagree. I would prepare an explanation in case someone important complains to the CEO :-) – RedGrittyBrick Sep 02 '15 at 15:56

Converting a UTF-8 file to ASCII (best-effort)

5 Answers5