How to find and print out English words contained in a file via linux command line?
4 Answers
GNU grep has the following options:
grep --only-matching --ignore-case --fixed-strings --file /usr/share/dict/british-english-insane /path/to/file.txt
This outputs strings found one-per-line.
Here /usr/share/dict/british-english-insane is a wordlist provided by the Debian package wbritish-insane.
- 1,177
- 7
- 12
He, funny !
file=/usr/share/licenses/common/GPL3/license.txt
dict=/usr/share/dict/cracklib-small
while read word; do
grep >/dev/null -i "\<$word\>" $file &&
printf 'Word "%s" found in GPLv3...\n' $word
done < $dict
Output :
Word a found in GPLv3...
Word ability found in GPLv3...
Word about found in GPLv3...
(...)
The cracklib-small file comes with the package cracklib http://sourceforge.net/projects/cracklib
- 31,569
- 7
- 64
- 82
-
Corrected a mistaken input file – Gilles Quénot Oct 02 '12 at 01:48
-
If you use `-q` instead of `>/dev/null`, it will imply `-m1` and it will not read the **entire** file every iteration of the loop. Personally, I would store the file in memory if it was small enough. – jordanm Oct 02 '12 at 05:02
-
1This is *extremely* (big-time!) slow... cracklib-small has only 52,875 words and it has been running for over half an hour (compare to *my* and *donothingsuccessfully's* times, see my answer. That dictionary has 390,000 words) ... you are calling *grep* too many times.. try for *once*. – Peter.O Oct 02 '12 at 12:36
-
My first thought was not to make something robust & quick but making a simple script _that work's_ and simple to understand for beginners. – Gilles Quénot Oct 03 '12 at 11:06
grep based solutions will generally be quite slow especially with large word lists.
You can take advantage from the fact that word lists are already sorted (however on my system, it seems at least the british-english one has been sorted in the POSIX/C locale even though it's UTF-8 encoded):
tr -cs "[:alpha:]'" '[\n*]' < /etc/passwd |
LC_ALL=C sort -u |
LC_ALL=C comm -12 - /usr/share/dict/british-english-insane
You may also want to convert everything lowercase or uppercase beforehand if you want to look for words in a case-insensistive manner.
- 522,931
- 91
- 1,010
- 1,501
-
This is good; `tr -c` is much simpler than excluding values manually... Perhaps it could use some case (insensitive) awareness, as the dictionary is lower-case or titl-case only. +1 – Peter.O Oct 02 '12 at 13:41
file=/usr/lib/python2.6/LICENSE.txt
dict=/usr/share/dict/british-english-huge # or any suitable list
sort "$dict" \
<(sed "s/[].,\"?!;:#$%&()*+<>=@\^_{}|~[]\+/\n/g # keep ' for now
s|[-/[[:digit:][:blank:][:cntrl:]]\+|\n|g
s/\<'\+/\n/; s/'\>\+/\n/ # remove '
" <(<"$file" tr '[:upper:]' '[:lower:]') ) |
uniq -c | awk '$1 > +1 {print $2}'
found 382 words (case insensitive) in time:
real 0m1.723s
user 0m1.872s
sys 0m0.048s
- 32,426
- 28
- 115
- 163