I am looking for a command to count number of all words in a file. For instance if a file is like this,
today is a
good day
then it should print 5, since there are 5 words there.
I am looking for a command to count number of all words in a file. For instance if a file is like this,
today is a
good day
then it should print 5, since there are 5 words there.
The command wc aka. word count can do it:
$ wc -w <file>
$ cat sample.txt
today is a
good day
$ wc -w sample.txt
5 sample.txt
# just the number (thanks to Stephane Chazelas' comment)
$ wc -w < sample.txt
5
I came up with this for JUST the number:
wc -w [file] | cut -d' ' -f1
5
I also like the wc -w < [file] approach
Finally, for storing just the word count in a variable, you could use the following:
myVar=($(wc -w /path/to/file))
This lets you skip the filename elegantly.
$ function wordfrequency() { awk 'BEGIN { FS="[^a-zA-Z]+" } { for (i=1; i<=NF; i++) { word = tolower($i) words[word]++ } } END { for (w in words) printf("%3d %s\n", words[w], w) } ' | sort -rn }
$ cat your_file.txt | wordfrequency
This lists the frequency of each word occurring in the provided file. I know it's not what you asked for, but it's better! If you want to see the occurrences of your word, you can just do this:
$ cat your_file.txt | wordfrequency | grep yourword
I even added this function to my .dotfiles
Source: AWK-ward Ruby
The wc program counts "words", but those are not for instance the "words" that many people would see when they examine a file. The vi program for instance uses a different measure of "words", delimiting them based on their character classes, while wc simply counts things separated by whitespace. The two measures can be radically different. Consider this example:
first,second
vi sees three words (first and second as well as the comma separating them), while wc sees one (there is no whitespace on that line). There are many ways to count words, some are less useful than others.
While Perl would be better suited to writing a counter for the vi-style words, here is a quick example using sed, tr and wc (moderately portable using literal carriage returns ^M):
#!/bin/sh
in_words="[[:alnum:]_]"
in_punct="[][{}\\|:\"';<>,./?\`~!@#$%^&*()+=-]"
sed -e "s/\($in_words\)\($in_punct\)/\1^M\2/g" \
-e "s/\($in_punct\)\($in_words\)/\1^M\2/g" \
-e "s/[[:space:]]/^M/g" \
"$@" |
tr '\r' '\n' |
sed -e '/^$/d' |
wc -l
Comparing counts:
wc gives 28.For reference, POSIX vi says:
In the POSIX locale, vi shall recognize five kinds of words:
A maximal sequence of letters, digits, and underscores, delimited at both ends by:
Characters other than letters, digits, or underscores
The beginning or end of a line
The beginning or end of the edit buffer
A maximal sequence of characters other than letters, digits, underscores, or characters, delimited at both ends by:
- A letter, digit, underscore
<blank>characters- The beginning or end of a line
- The beginning or end of the edit buffer
One or more sequential blank lines
The first character in the edit buffer
The last non-
<newline>in the edit buffer
The better solution is using Perl:
perl -nle '$word += scalar(split(/\s+/, $_)); END{print $word}' filename
@Bernhard
You can check the source code of wc command from coreutils, I test in my machine, with file subst.c in bash 4.2 source.
time wc -w subst.c
real 0m0.025s
user 0m0.016s
sys 0m0.000s
And
time perl -nle '$word += scalar(split(" ", $_)); END{print $word}' subst.c
real 0m0.021s
user 0m0.016s
sys 0m0.004s
The bigger the file is, the more efficient Perl is with respect to wc.