Getting a wordlist from a DICT formatted dictionary

Question

I want a latin word list for research / reference purposes. (Like /usr/share/dict/words)

There would appear to be no such word list (apt-file search /usr/share/dict | sort | uniq | grep latin), but there is a DICT english latin dictionary: dict-freedict-eng-lat.

Is there an easy way to get a word list from this?

I tried some quick manual parsing of the .dz using sed but the format looks complicated enough that it needs a little parsing. I tried the dictunformat command, however it produces a c5 database which looks to be a binary format. I can't find tools to interact with such files.

Does it have to be a "DICT formatted dictionary"? You can get wordlists e.g. from spelling dictionaries (`ibritish`, `ibritish-huge`, ...) using e.g. aspell (`aspell -l en dump master | aspell -l en expand`). — dirkt, Mar 16 '17 at 12:23
That's useful to know and would apply in many cases. There doesn't seem to be a latin aspell dictionary in debian (`apt-cache search aspell | grep latin`) — Att Righ, Mar 16 '17 at 12:49

score 4 · Answer 1 · answered Mar 16 '17 at 20:39

4

zcat /usr/share/dictd/freedict-eng-lat.dict.dz | perl -e 'my %dict; $start=0; $/="\n"; while (<>) { next if $_ =~ m/(\/|\x90)/; chomp; $_ =~ s/[0-9\. ]*//g; $start = 1 if $_ eq 'abecedarium'; next if $start==0; @words=split(/\;/,$_); foreach my $word (@words) { $dict{$word}=1;} }; $,="\n"; print sort keys %dict;'

to uncompress file, skip all english lines containing /pronounciation/, skip lines with weird DLE character, skip all header lines till we reach the first real word "abecedarium", remove numbers, dots and spaces, split conjugations separated by ";" and add every word to a hash to have unique entries. in the end, print all words separated by $, set to new-line \n

sample output:

ager
agere
agna
agnellina
agnina

answered Mar 16 '17 at 20:39

claudiuf

266
1
2

Thanks for doing the parsing I was too lazy to do :). I wonder if there's a more magic way with less code, but in the absence of other answers this works. – Att Righ Mar 16 '17 at 23:38
of course there is. `zcat /usr/share/dictd/freedict-eng-lat.dict.dz | grep -e '^ [a-zA-Z0-9]' | grep -o -P '[[:alpha:]]+' | sort -du` – claudiuf Mar 17 '17 at 01:19
it is just less efficient imho – claudiuf Mar 17 '17 at 01:20
cool +1 (just one warning: grep -o -P '[[:alpha:]]+' woud fail words like ævitas) – JJoao Mar 17 '17 at 10:15

JJoao · Accepted Answer · 2017-03-17T10:03:07.453

2

If I remember correctly, .dz is a variant of gzip that allows to gunzip just the necessary chunks. Try:

zcat dict-freedict-eng-lat.dz

Most of the ".dz" are generated from more comprehensible formats (in the case the format is TEI) using freedict-tools.

UPDATE: (I like hacking solutions but) now the "not so hacking" way:

1) If you want Latin, get the sources of freedict Lat-Eng (the inverse dictionary):

wget "https://sourceforge.net/projects/freedict/files/Latin%20-%20English/0.1.1/freedict-lat-eng-0.1.1.src.tar.bz2"

2) unzip it:

tar -xvjf freedict-lat-eng-0.1.1.src.tar.bz2

and enjoy the pleasure of dealing with the sources...

3) get Latin entry (orth xml tag) from the XML-TEI source (lat-eng/lat-eng.tei):

xidel -e "//orth" lat-eng/lat-eng.tei

One last suggestion: use the Latin-German dictionary (more complete)

https://sourceforge.net/projects/freedict/files/Latin%20-%20German/0.4/freedict-lat-deu-0.4.src.tar.bz2
tar ...
xidel -e //orth lat-deu/lat-deu.tei |  sort -u | wc        (9730)

edited Mar 17 '17 at 10:03

answered Mar 16 '17 at 19:44

JJoao

11,887
1
22
44

Yep that's correct. It gives you something human readable. It's one of the first things I tried (together with a `head` and a `sed 0~2 p`) but there are multiline entries that confuse this (I think the index file might deal with these somehow) – Att Righ Mar 16 '17 at 23:41
@AttRigh, sorry I misunderstood the problem. I updated with a complementary suggestion. – JJoao Mar 17 '17 at 09:30
Awesome. I didn't realise that freedict have a [nice collection of xml dictionaries](https://github.com/freedict/fd-dictionaries/search?utf8=✓&q=extension:tei&type=Code) in this xml [TEI format](http://www.tei-c.org/index.xml). These don't seem to be packaged by debian but can be fetched directly (as you do). – Att Righ Mar 17 '17 at 12:52
@AttRigh, Installing one at the time is always a better Idea... (see also `dict-freedict-all` -- it will show you the list of near 100 debian packages of freedicts) – JJoao Mar 17 '17 at 14:55
1

Cool cool. I meant to say that the xml files aren't contained in the debian packages as far as I can tell. – Att Righ Mar 17 '17 at 14:56
Correct. I would like to see those XML-TEI files in some `/usr/share/dict....` – JJoao Mar 17 '17 at 15:01
where does xidel come from? – slashdottir Sep 15 '17 at 19:38
@slashdottir, for example `http://www.videlibri.de/xidel.html#downloads` – JJoao Sep 15 '17 at 22:02

score -1 · Answer 3 · answered Feb 13 '18 at 17:12

-1

I wrote an article on how to dump and convert Aspell dictionaries to wordlist and later to searchable MySQL/MariaDB database:

https://www.joe0.com/2018/02/13/how-to-dump-and-convert-aspell-dictionary-to-wordlist-or-searchable-mysql-mariadb-database/

answered Feb 13 '18 at 17:12

jjj

99
1

Please don't post link-only answers on here. – jesse_b Feb 13 '18 at 18:52

Getting a wordlist from a DICT formatted dictionary

3 Answers3