7

How to find and print out English words contained in a file via linux command line?

Anthon
  • 78,313
  • 42
  • 165
  • 222
Agustinus Verdy
  • 173
  • 1
  • 3

4 Answers4

11

GNU grep has the following options:

grep --only-matching --ignore-case --fixed-strings --file /usr/share/dict/british-english-insane /path/to/file.txt

This outputs strings found one-per-line. Here /usr/share/dict/british-english-insane is a wordlist provided by the Debian package wbritish-insane.

4

He, funny !

file=/usr/share/licenses/common/GPL3/license.txt
dict=/usr/share/dict/cracklib-small

while read word; do
    grep >/dev/null -i "\<$word\>" $file &&
        printf 'Word "%s" found in GPLv3...\n' $word
done < $dict

Output :

Word a found in GPLv3...
Word ability found in GPLv3...
Word about found in GPLv3...
(...)

The cracklib-small file comes with the package cracklib http://sourceforge.net/projects/cracklib

Gilles Quénot
  • 31,569
  • 7
  • 64
  • 82
  • Corrected a mistaken input file – Gilles Quénot Oct 02 '12 at 01:48
  • If you use `-q` instead of `>/dev/null`, it will imply `-m1` and it will not read the **entire** file every iteration of the loop. Personally, I would store the file in memory if it was small enough. – jordanm Oct 02 '12 at 05:02
  • 1
    This is *extremely* (big-time!) slow... cracklib-small has only 52,875 words and it has been running for over half an hour (compare to *my* and *donothingsuccessfully's* times, see my answer. That dictionary has 390,000 words) ... you are calling *grep* too many times.. try for *once*. – Peter.O Oct 02 '12 at 12:36
  • My first thought was not to make something robust & quick but making a simple script _that work's_ and simple to understand for beginners. – Gilles Quénot Oct 03 '12 at 11:06
3

grep based solutions will generally be quite slow especially with large word lists.

You can take advantage from the fact that word lists are already sorted (however on my system, it seems at least the british-english one has been sorted in the POSIX/C locale even though it's UTF-8 encoded):

tr -cs "[:alpha:]'" '[\n*]' < /etc/passwd |
  LC_ALL=C sort -u |
  LC_ALL=C comm -12 - /usr/share/dict/british-english-insane

You may also want to convert everything lowercase or uppercase beforehand if you want to look for words in a case-insensistive manner.

Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
  • This is good; `tr -c` is much simpler than excluding values manually... Perhaps it could use some case (insensitive) awareness, as the dictionary is lower-case or titl-case only. +1 – Peter.O Oct 02 '12 at 13:41
1
file=/usr/lib/python2.6/LICENSE.txt
dict=/usr/share/dict/british-english-huge   # or any suitable list

sort "$dict" \
     <(sed "s/[].,\"?!;:#$%&()*+<>=@\^_{}|~[]\+/\n/g   # keep ' for now
            s|[-/[[:digit:][:blank:][:cntrl:]]\+|\n|g
            s/\<'\+/\n/; s/'\>\+/\n/                   # remove '
           " <(<"$file" tr '[:upper:]' '[:lower:]') ) |
uniq -c | awk '$1 > +1 {print $2}' 

found 382 words (case insensitive) in time:

real   0m1.723s
user   0m1.872s
sys    0m0.048s
Peter.O
  • 32,426
  • 28
  • 115
  • 163