In Persian numerals, ۰۱۲۳۴۵۶۷۸۹ is equivalent to 0123456789 in European digits.
How can I convert Persian number ( in UTF-8 ) to ASCII?
For example, I want ۲۱ to become 21.
In Persian numerals, ۰۱۲۳۴۵۶۷۸۹ is equivalent to 0123456789 in European digits.
How can I convert Persian number ( in UTF-8 ) to ASCII?
For example, I want ۲۱ to become 21.
Since it's a fixed set of numbers, you can do it by hand:
$ echo ۲۱ | LC_ALL=en_US.UTF-8 sed -e 'y/۰۱۲۳۴۵۶۷۸۹/0123456789/'
21
(or using tr, but not GNU tr yet)
Setting your locale to en_US.utf8 (or better to the locale which characters set belongs to) is required for sed to recognize your characters set.
With perl:
$ echo "۲۱" |
perl -CS -MUnicode::UCD=num -MUnicode::Normalize -lne 'print num(NFKD($_))'
21
For Python there is the unidecode library which handles such conversions in general: https://pypi.python.org/pypi/Unidecode.
In Python 2:
>>> from unidecode import unidecode
>>> unidecode(u"۰۱۲۳۴۵۶۷۸۹")
'0123456789'
In Python 3:
>>> from unidecode import unidecode
>>> unidecode("۰۱۲۳۴۵۶۷۸۹")
'0123456789'
The SO thread at https://stackoverflow.com/q/8087381/2261442 might be related.
/edit:
As Wander Nauta pointed out in the comments and as mentioned on the Unidecode page there is also a shell version of unidecode (under /usr/local/bin/ if installed over pip):
$ echo '۰۱۲۳۴۵۶۷۸۹' | unidecode
0123456789
A pure bash version:
#!/bin/bash
number="$1"
number=${number//۱/1}
number=${number//۲/2}
number=${number//۳/3}
number=${number//۴/4}
number=${number//۵/5}
number=${number//۶/6}
number=${number//۷/7}
number=${number//۸/8}
number=${number//۹/9}
number=${number//۰/0}
echo "Result is $number"
Have tested in my Gentoo machine and it works.
./convert ۱۳۲
Result is 132
Done as a loop, given the list of characters (from 0 to 9) to convert:
#!/bin/bash
conv() ( LC_ALL=en_US.UTF-8
local n="$2"
for ((i=0;i<${#1};i++)); do
n=${n//"${1:i:1}"/"$i"}
done
printf '%s\n' "$n"
)
conv "۰۱۲۳۴۵۶۷۸۹" "$1"
And used as:
$ convert ۱۳۲
132
Another (rather overkill) way using grep:
#!/bin/bash
nums=$(echo "$1" | grep -o .)
result=()
for i in $nums
do
case $i in
۱)
result+=1
;;
۲)
result+=2
;;
۳)
result+=3
;;
۴)
result+=4
;;
۵)
result+=5
;;
۶)
result+=6
;;
۷)
result+=7
;;
۸)
result+=8
;;
۹)
result+=9
;;
۰)
result+=0
;;
esac
done
echo "Result is $result"
We can take advantage of the fact that the UNICODE code point of Persian numerals are consecutive and ordered from 0 to 9:
$ printf '%b' '\U06F'{0..9}
۰۱۲۳۴۵۶۷۸۹
That means that the last hex digit IS the decimal value:
$ echo $(( $(printf '%d' "'۲") & 0xF ))
2
That makes this simple loop a conversion tool:
#!/bin/bash
( ### Use a locale that use UTF-8 to make the script more reliable.
### Maybe something like LC_ALL=fa_IR.UTF-8 for you?.
LC_ALL=en_US.UTF-8
a="$1"
while (( ${#a} > 0 )); do
# extract the last hex digit from the UNICODE code point
# of the first character in the string "$a":
printf '%d' $(( $(printf '%d' "'$a") & 15 ))
a=${a#?} ## Remove one character from $a
done
)
echo
Using it as:
$ sefr.sh ۰۱۲۳۴۵۶۷۸۹
0123456789
$ sefr.sh ۲۰۱
201
$ sefr.sh ۲۱
21
Note that this code could also convert Arabic and Latin numerals (even if mixed):
$ sefr.sh ۴4٤۵5٥۶6٦۷7٧۸8٨۹9٩
444555666777888999
$ sefr.sh ٤٧0٠٦7١٣3٥۶٦۷
4700671335667
Since iconv can't seem to grok this, the next port of call would be to use the tr utility:
$ echo "۲۱" | tr '۰۱۲۳۴۵۶۷۸۹' '0123456789'
21
tr translates one set of characters to another, so we simply tell it to translate the set of Farsi digits to the set of Latin digits.
EDIT: As user @cuonglm points out. This requires non-GNU tr, for example the tr on a Mac, and it also requires that $LC_CTYPE is set to en_US.UTF-8.
numconv is in the repository of some Linux distros, Debian and Ubuntu, at least. Install numconv.
$ echo '۱۲۳۴۵۶۷۸۹۰' | numconv
1234567890
(Edit: Note that leading zeros are removed, and that this is purely for numeric conversion, and will not work with streams that contain non-numeric characters as well.)