How can I convert Persian numerals in UTF-8 to European numerals in ASCII?

Question

In Persian numerals, ۰۱۲۳۴۵۶۷۸۹ is equivalent to 0123456789 in European digits.

How can I convert Persian number ( in UTF-8 ) to ASCII?

For example, I want ۲۱ to become 21.

Interesting, it seems like `echo "۰۱۲۳۴۵۶۷۸۹" | iconv -f UTF-8 -t ascii//TRANSLIT` doesn't handle it... — Kusalananda, Jun 19 '16 at 11:52
@Kusalananda: Is it really that unexpected? As I understood it `iconv` is just here to map characters in different encodings, but these are characters (Eastern Arabic numerals) that have no equivalent in ASCII, you can just convert them to something similar enough but it's one-way only. — phk, Jun 19 '16 at 12:20
Well, I wasn't quite sure what `iconv` was capable and not capable of doing. I was hoping thot using `//TRANSLIT` would help, but it didn't. — Kusalananda, Jun 19 '16 at 12:25
Do you also need to reverse the order? I know that Arabic numerals are written little-endian right-to-left, and Latin numerals are big-endian left-to-right (looking similar in print or on screen, but reversed in memory). Is Persian the same? — Toby Speight, Jun 20 '16 at 12:13
@TobySpeight : no reverse; arabic and persian numberic is left-to-right like european digit,, only alphabet is write right-to-left — Baba, Jun 20 '16 at 19:11

score 30 · Answer 1 · edited Apr 13 '17 at 12:36

30

Since it's a fixed set of numbers, you can do it by hand:

$ echo ۲۱ | LC_ALL=en_US.UTF-8 sed -e 'y/۰۱۲۳۴۵۶۷۸۹/0123456789/'
21

(or using tr, but not GNU tr yet)

Setting your locale to en_US.utf8 (or better to the locale which characters set belongs to) is required for sed to recognize your characters set.

With perl:

$ echo "۲۱" |
  perl -CS -MUnicode::UCD=num -MUnicode::Normalize -lne 'print num(NFKD($_))'
21

edited Apr 13 '17 at 12:36

Community

1

answered Jun 19 '16 at 11:58

cuonglm

150,973
38
327
406

Setting the `LC_ALL` is needed so that every single unicode characters will be also considered as such by `sed`, right? – phk Jun 19 '16 at 12:03
@phk: Yes, see the updating. – cuonglm Jun 19 '16 at 12:04
Why must everything be a sed script? Didn't we invent `tr` for this exact purpose? – Kevin Jun 19 '16 at 15:26
3

@Kevin See the other answer involving `tr` how it does not work everywhere. Also keep in mind that some tools are optimized for dealing with bytes while others are for dealing with characters, with Unicode (especially UTF-8) this makes a huge difference. – phk Jun 19 '16 at 15:28
This doesn’t work for me on OS X 10.10.5/GNU bash 4.3. Weirdly enough I need to *remove* the explicit setting of `LC_ALL`. `LC_ALL` is also not set in my environment (but `LANG` is set to `en_GB.UTF-8`). With the above code, I get the error “sed: 1: "y/۰۱۲۳۴۵۶۷۸۹/ ...": transform strings are not the same length”. – Konrad Rudolph Jun 20 '16 at 15:32
@KonradRudolph: Check if your locale has `en_US.utf8`. What command did you run? How about setting `LC_ALL=en_GB.UTF-8`? – cuonglm Jun 20 '16 at 15:42
@cuonglm Indeed, when I replace `utf8` with `UTF-8` it works. In fact, I’ve never heard of the spelling without the dash. Might this be a typo in your answer or do some systems use that name? (EDIT, found this: http://superuser.com/a/999151/2269) – Konrad Rudolph Jun 20 '16 at 15:44

score 16 · Answer 2 · edited May 23 '17 at 12:39

16

For Python there is the unidecode library which handles such conversions in general: https://pypi.python.org/pypi/Unidecode.

In Python 2:

>>> from unidecode import unidecode
>>> unidecode(u"۰۱۲۳۴۵۶۷۸۹")
'0123456789'

In Python 3:

>>> from unidecode import unidecode
>>> unidecode("۰۱۲۳۴۵۶۷۸۹")
'0123456789'

The SO thread at https://stackoverflow.com/q/8087381/2261442 might be related.

/edit: As Wander Nauta pointed out in the comments and as mentioned on the Unidecode page there is also a shell version of unidecode (under /usr/local/bin/ if installed over pip):

$ echo '۰۱۲۳۴۵۶۷۸۹' | unidecode
0123456789

edited May 23 '17 at 12:39

Community

1

answered Jun 19 '16 at 11:39

phk

5,893
7
41
70

2

The unidecode library also ships a utility called (unsurprisingly) `unidecode` which does the same as your Python 3 snippet. Just `echo '۰۱۲۳۴۵۶۷۸۹' | unidecode` should work. – Wander Nauta Jun 20 '16 at 11:43
@Wander - the Debian package of python-unidecode doesn't ship the utility program, so the long form may be necessary on such platforms (I didn't find one in the source tarball from upstream, so perhaps the program is something added by your distribution?) – Toby Speight Jun 20 '16 at 15:25
@TobySpeight If you install it using `pip` it's there. – phk Jun 20 '16 at 15:30
@TobySpeight The utility is in the upstream tarball as `unidecode/util.py` - strange that Debian doesn't include it. (Edit: Ah, mystery solved. The Debian package is out of date and older than the utility.) – Wander Nauta Jun 20 '16 at 15:31

Vombat · Answer 3 · 2016-06-30T06:04:49.540

8

A pure bash version:

#!/bin/bash

number="$1"

number=${number//۱/1}
number=${number//۲/2}
number=${number//۳/3}
number=${number//۴/4}
number=${number//۵/5}
number=${number//۶/6}
number=${number//۷/7}
number=${number//۸/8}
number=${number//۹/9}
number=${number//۰/0}

echo "Result is $number"

Have tested in my Gentoo machine and it works.

./convert ۱۳۲
Result is 132

Done as a loop, given the list of characters (from 0 to 9) to convert:

#!/bin/bash
conv() ( LC_ALL=en_US.UTF-8
         local n="$2"
         for ((i=0;i<${#1};i++)); do
              n=${n//"${1:i:1}"/"$i"}
         done
         printf '%s\n' "$n"
       )

conv "۰۱۲۳۴۵۶۷۸۹" "$1"

And used as:

$ convert ۱۳۲
132

Another (rather overkill) way using grep:

#!/bin/bash

nums=$(echo "$1" | grep -o .)
result=()

for i in $nums
do
    case $i in
        ۱)
            result+=1
            ;;
        ۲)
            result+=2
            ;;
        ۳)
            result+=3
            ;;
        ۴)
            result+=4
            ;;
        ۵)
            result+=5
            ;;
        ۶)
            result+=6
            ;;
        ۷)
            result+=7
            ;;
        ۸)
            result+=8
            ;;
        ۹)
            result+=9
            ;;
        ۰)
            result+=0
            ;;
    esac
done
echo "Result is $result"

edited Jun 30 '16 at 06:04

answered Jun 20 '16 at 06:50

Vombat

12,654
13
44
58

1

Pure Bash, except for the `grep`. In fact, I don't understand that line, nor why you do not set `result=0`. Are you being overly cautious in case `$1` contains things other than Farsi digits? – Kusalananda Jun 20 '16 at 06:56
@Kusalananda that line reads the Farsi digits into nums. Makes it loop-able. – Vombat Jun 20 '16 at 07:01
1

Ten simple substitutions would have been quicker... `number=${number//۱/1}` etc., and would avoid the `echo` and `grep`. – Kusalananda Jun 20 '16 at 07:06
1

@Kusalananda Nice. Changed it. Now it is pure Bash! ;-) – Vombat Jun 20 '16 at 07:18
@coffeMug : ۱۳۲ is 132 no 123 :D – Baba Jun 20 '16 at 10:26
@Babyy Damn copy paste! And you have sharp eyes. ;-) – Vombat Jun 20 '16 at 10:29

score 7 · Accepted Answer · 2016-06-30T01:34:31.863

7

We can take advantage of the fact that the UNICODE code point of Persian numerals are consecutive and ordered from 0 to 9:

$ printf '%b' '\U06F'{0..9}
۰۱۲۳۴۵۶۷۸۹

That means that the last hex digit IS the decimal value:

$ echo $(( $(printf '%d' "'۲") & 0xF ))
2

That makes this simple loop a conversion tool:

#!/bin/bash
(   ### Use a locale that use UTF-8 to make the script more reliable.
    ### Maybe something like LC_ALL=fa_IR.UTF-8 for you?.
    LC_ALL=en_US.UTF-8
    a="$1"
    while (( ${#a} > 0 )); do
        # extract the last hex digit from the UNICODE code point
        # of the first character in the string "$a":
        printf '%d' $(( $(printf '%d' "'$a") & 15 ))
        a=${a#?}    ## Remove one character from $a
    done
)
echo

Using it as:

$ sefr.sh ۰۱۲۳۴۵۶۷۸۹
0123456789

$ sefr.sh ۲۰۱
201

$ sefr.sh ۲۱
21

Note that this code could also convert Arabic and Latin numerals (even if mixed):

$ sefr.sh ۴4٤۵5٥۶6٦۷7٧۸8٨۹9٩
444555666777888999

$ sefr.sh ٤٧0٠٦7١٣3٥۶٦۷
4700671335667

edited Jun 30 '16 at 01:34

answered Jun 28 '16 at 03:09

very very thanks, this is very nice solution,, and i have question ,,in this command printf '%d' '"۰' why use double-quotation ? – Baba Jun 28 '16 at 07:59
1

@Babyy It is not a double quotation, it is a way to give printf an argument that start with a single quote: `'۰`. It could have been written also as `'"۰'`. The reason is that printf will give the UNICODE code point if the argument starts with a single quote `'` or a double quote `"`. Search a little [before this link](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html#tag_20_94_14) for the text "If the leading character is a single-quote or double-quote" – Jun 29 '16 at 03:35
@Babyy The code has been extended to convert Persian, Arabic, and Latin (even if mixed). – Jun 29 '16 at 07:03

Kusalananda · Answer 5 · 2016-06-19T12:12:06.930

3

Since iconv can't seem to grok this, the next port of call would be to use the tr utility:

$ echo "۲۱" | tr '۰۱۲۳۴۵۶۷۸۹' '0123456789'
21

tr translates one set of characters to another, so we simply tell it to translate the set of Farsi digits to the set of Latin digits.

EDIT: As user @cuonglm points out. This requires non-GNU tr, for example the tr on a Mac, and it also requires that $LC_CTYPE is set to en_US.UTF-8.

edited Jun 19 '16 at 12:12

answered Jun 19 '16 at 12:00

Kusalananda

320,670
36
633
936

2

Note that it won't work with GNU tr, which does not support multi-byte characters. – cuonglm Jun 19 '16 at 12:01
1

Oh my. Silly GNU. ;-) – Kusalananda Jun 19 '16 at 12:02
And also you need to set your locale to the one which supports unicode, like `en_US.utf8`. – cuonglm Jun 19 '16 at 12:07

score 1 · Answer 6 · answered Oct 20 '20 at 10:13

numconv is in the repository of some Linux distros, Debian and Ubuntu, at least. Install numconv.

$ echo '۱۲۳۴۵۶۷۸۹۰' | numconv
1234567890

(Edit: Note that leading zeros are removed, and that this is purely for numeric conversion, and will not work with streams that contain non-numeric characters as well.)

How can I convert Persian numerals in UTF-8 to European numerals in ASCII?

6 Answers6