Look for English word in a file via terminal

Question

How to find and print out English words contained in a file via linux command line?

score 11 · Accepted Answer · answered Oct 02 '12 at 04:36

GNU grep has the following options:

grep --only-matching --ignore-case --fixed-strings --file /usr/share/dict/british-english-insane /path/to/file.txt

This outputs strings found one-per-line. Here /usr/share/dict/british-english-insane is a wordlist provided by the Debian package wbritish-insane.

Gilles Quénot · Answer 2 · 2012-10-02T01:46:58.867

4

He, funny !

file=/usr/share/licenses/common/GPL3/license.txt
dict=/usr/share/dict/cracklib-small

while read word; do
    grep >/dev/null -i "\<$word\>" $file &&
        printf 'Word "%s" found in GPLv3...\n' $word
done < $dict

Output :

Word a found in GPLv3...
Word ability found in GPLv3...
Word about found in GPLv3...
(...)

The cracklib-small file comes with the package cracklib http://sourceforge.net/projects/cracklib

edited Oct 02 '12 at 01:46

answered Oct 02 '12 at 01:21

Gilles Quénot

31,569
7
64
82

Corrected a mistaken input file – Gilles Quénot Oct 02 '12 at 01:48
If you use `-q` instead of `>/dev/null`, it will imply `-m1` and it will not read the **entire** file every iteration of the loop. Personally, I would store the file in memory if it was small enough. – jordanm Oct 02 '12 at 05:02
1

This is *extremely* (big-time!) slow... cracklib-small has only 52,875 words and it has been running for over half an hour (compare to *my* and *donothingsuccessfully's* times, see my answer. That dictionary has 390,000 words) ... you are calling *grep* too many times.. try for *once*. – Peter.O Oct 02 '12 at 12:36
My first thought was not to make something robust & quick but making a simple script _that work's_ and simple to understand for beginners. – Gilles Quénot Oct 03 '12 at 11:06

Stéphane Chazelas · Answer 3 · 2012-10-02T14:21:49.610

3

grep based solutions will generally be quite slow especially with large word lists.

You can take advantage from the fact that word lists are already sorted (however on my system, it seems at least the british-english one has been sorted in the POSIX/C locale even though it's UTF-8 encoded):

tr -cs "[:alpha:]'" '[\n*]' < /etc/passwd |
  LC_ALL=C sort -u |
  LC_ALL=C comm -12 - /usr/share/dict/british-english-insane

You may also want to convert everything lowercase or uppercase beforehand if you want to look for words in a case-insensistive manner.

edited Oct 02 '12 at 14:21

answered Oct 02 '12 at 12:21

Stéphane Chazelas

522,931
91
1,010
1,501

This is good; `tr -c` is much simpler than excluding values manually... Perhaps it could use some case (insensitive) awareness, as the dictionary is lower-case or titl-case only. +1 – Peter.O Oct 02 '12 at 13:41

Peter.O · Answer 4 · 2012-10-02T13:46:37.727

file=/usr/lib/python2.6/LICENSE.txt
dict=/usr/share/dict/british-english-huge   # or any suitable list

sort "$dict" \
     <(sed "s/[].,\"?!;:#$%&()*+<>=@\^_{}|~[]\+/\n/g   # keep ' for now
            s|[-/[[:digit:][:blank:][:cntrl:]]\+|\n|g
            s/\<'\+/\n/; s/'\>\+/\n/                   # remove '
           " <(<"$file" tr '[:upper:]' '[:lower:]') ) |
uniq -c | awk '$1 > +1 {print $2}'

found 382 words (case insensitive) in time:

real   0m1.723s
user   0m1.872s
sys    0m0.048s

Look for English word in a file via terminal

4 Answers4

Linked