Character count of language X in mixed text file?

Question

I have mixed-language text files, and would like to count the simple total number of printable characters of one of the languages. It helps that the languages inhabit different unicode ranges.

My specific use-case involves Hebrew, Polytonic Greek, and English -- but I imagine a solution to this problem could be generalized for other contexts, too.

I would like to count to the Hebrew characters only -- that's Unicode [\u0590-\u05ff]. Here's a brief sample input file (which, by my manual count, contains 62 Hebrew characters):

[ Ps117 ]‬
h1: ‫  הללו את יהוה כל גוים שבחוהו כל האמים ‬
r1: Praise the LORD, all nations! Extol him, all peoples!
g1: Αλληλουια. Αἰνεῖτε τὸν κύριον, πάντα τὰ ἔθνη, ἐπαινέσατε αὐτόν, πάντες οἱ λαοί,
b1: Alleluia. Praise the Lord all you nations: praise him all you peoples.

h2: ‫  כי גבר עלינו חסדו ואמת יהוה לעולם הללו יה ‬
r2: For great is his steadfast love toward us; and the faithfulness of the LORD endures for ever. Praise the LORD!
g2: ὅτι ἐκραταιώθη τὸ ἔλεος αὐτοῦ ἐφ' ἡμᾶς, καὶ ἡ ἀλήθεια τοῦ κυρίου μένει εἰς τὸν αἰῶνα.
b2: For his mercy has been abundant toward us: and the truth of the Lord endures for ever.

I'm on Ubuntu 16.04.2 LTS, if that helps. I imagine perl would be a likely option here, or some shell script ... but I don't know these things, which is why I'm asking!

_{For the curious, the lines in my input are: h= Hebrew; r= Revised Standard Version; g = Greek Septuagint; b = Brenton translation of Septuagint; in each case followed by a verse number.}

So what about spaces? Also, it would be pretty straight forward to only count characters on lines starting with `h1: `, `h2: ` etc. — Stephen Rauch, Jun 20 '17 at 14:01
I'd use a Perl one-liner to remove all unicode chars except those in your range (see e.g. [here](https://unix.stackexchange.com/questions/228558/how-to-make-tr-aware-of-non-asciiunicode-characters) how to use as `tr` substitute), `man perlre`), then count remaining chars. — dirkt, Jun 20 '17 at 14:05
@StephenRauch - Yes, whitespace would be a bit of a pain. Fortunately, all I'm after is the "printable" Hebrew characters. The `h1: ` prefix is simply a quirk of this input file; hopefully any solution will rely on recognizing the unicode range, not my random file convention. ;) — Dɑvïd, Jun 20 '17 at 14:44
"Count" as in figure out how many distinct characters are, or their relative distribution; or just how many glyphs in this character range the file contains (basically the length of the text after you have removed all characters outside the desired range)? — tripleee, Jun 20 '17 at 17:35
@tripleee - Your third option (appropriately, given your username ;) = "`how many glyphs in this character range the file contains`". I've now tweaked the question to (hopefully!) make that clear. — Dɑvïd, Jun 20 '17 at 18:23

David Six · Answer 1 · 2017-06-20T16:34:59.173

4

There is potentially an issue with determining the length of Unicode strings. See this page from Twitter's developer docs for more details on Normalization

The character count will depend on the locale you have configured. You can run locale to verify that you have a UTF-8 locale configured. Once this is done, the code from @stephen-rauch should work.

Depending on which regex library you use, you might also be able to use named scripts like \p{Hebrew} and \P{Greek} Here is an example of using \P{Hebrew} to remove all non-Hebrew characters: Link

Edited: Initial results were due to mis-configured locale

edited Jun 20 '17 at 16:34

answered Jun 20 '17 at 14:10

David Six

51
2

@Dɑvïd It looks like the output of `wc` will depend on your locale, I will update my answer to reflect this. – David Six Jun 20 '17 at 15:30

score 4 · Answer 2 · answered Jun 20 '17 at 19:15

4

These seem to come close for me (tested on Ubuntu 16.04)

$ perl -0777 -MEncode -ne 'print decode("UTF-8",$_) =~ tr/\x{0590}-\x{05ff}//,"\n"' input
62
$ perl -0777 -MEncode -ne 'print decode("UTF-8",$_) =~ tr/\p{Hebrew}//,"\n"' input
63

I'm not sure what the "right" answer should be.

answered Jun 20 '17 at 19:15

steeldriver

78,509
12
109
152

The right answer (if I've counted correctly) is 62 -- I added it to the question. I wonder what `\p{Hebrew}` picks up that the range itself doesn't? Anyway -- thanks! – Dɑvïd Jun 20 '17 at 19:28

score 3 · Answer 3 · answered Jun 20 '17 at 15:47

3

Using python you can do something like this:

Code:

# coding: utf-8
import re
import codecs

#find_hebrew = re.compile(ur'[\u0590-\u05ff]+')  # python 2
find_hebrew = re.compile(r'[\u0590-\u05ff]+')   # python 3

count = 0
with codecs.open('text_file', 'rU', encoding='utf-8') as f:
    for line in f.readlines():
        for n in find_hebrew.findall(line):
            count += len(n)
print(count)

Result:

answered Jun 20 '17 at 15:47

Stephen Rauch

4,209
14
22
32

1

I've done a small tweak (made possible to pass input filename as argument) and added shebang and commented use notes, and [saved as a gist](https://gist.github.com/dajare/a610ee6ed10784cce972fc977cd0f095). If my tweaks could use tweaking, do tell! ;) Thanks again. – Dɑvïd Jun 21 '17 at 13:50

Character count of language X in mixed text file?

3 Answers3

Code:

Result: