how to count total number of words in a file?

Question

I am looking for a command to count number of all words in a file. For instance if a file is like this,

today is a 
good day

then it should print 5, since there are 5 words there.

[Don't solve what's been solved.](http://stackoverflow.com/a/3746969/1273830) — Prasanth, Jun 19 '13 at 17:31

slm · Accepted Answer · 2013-06-20T08:47:37.580

44

The command wc aka. word count can do it:

$ wc -w <file>

example

$ cat sample.txt
today is a 
good day


$ wc -w sample.txt
5 sample.txt


# just the number (thanks to Stephane Chazelas' comment)
$ wc -w < sample.txt
5

edited Jun 20 '13 at 08:47

answered Jun 19 '13 at 17:07

slm

363,520
117
767
871

1

Note that _words_ for `wc -w` don't have the same definition as for GNU `grep -w`. For `wc` a word is a sequence of one or more non-space characters (`[:space:]` character class in the current locale). For instance `foo,bar` and `foo bar` (with a non-breaking space) are each _one_ word. – Stéphane Chazelas Jun 12 '14 at 15:18

score 7 · Answer 2 · edited Dec 28 '15 at 23:05

7

I came up with this for JUST the number:

wc -w [file] | cut -d' ' -f1

5

I also like the wc -w < [file] approach

Finally, for storing just the word count in a variable, you could use the following:

myVar=($(wc -w /path/to/file))

This lets you skip the filename elegantly.

edited Dec 28 '15 at 23:05

Community

1

answered Jun 19 '13 at 17:29

Michael Durrant

41,213
69
165
232

15

`wc -w < "$file"` for JUST the number. – Stéphane Chazelas Jun 19 '13 at 20:24

score 3 · Answer 3 · answered Dec 15 '14 at 22:43

Let's use AWK!

$ function wordfrequency() { awk 'BEGIN { FS="[^a-zA-Z]+" } { for (i=1; i<=NF; i++) { word = tolower($i) words[word]++ } } END { for (w in words) printf("%3d %s\n", words[w], w) } ' | sort -rn } 
$ cat your_file.txt | wordfrequency

This lists the frequency of each word occurring in the provided file. I know it's not what you asked for, but it's better! If you want to see the occurrences of your word, you can just do this:

$ cat your_file.txt | wordfrequency | grep yourword

I even added this function to my .dotfiles

Source: AWK-ward Ruby

It counts words, so it is good enough for me! :-) – aggsol Jun 18 '18 at 06:57 — aggsol, Jun 18 '18 at 06:57

score 3 · Answer 4 · answered Sep 20 '16 at 11:40

The wc program counts "words", but those are not for instance the "words" that many people would see when they examine a file. The vi program for instance uses a different measure of "words", delimiting them based on their character classes, while wc simply counts things separated by whitespace. The two measures can be radically different. Consider this example:

first,second

vi sees three words (first and second as well as the comma separating them), while wc sees one (there is no whitespace on that line). There are many ways to count words, some are less useful than others.

While Perl would be better suited to writing a counter for the vi-style words, here is a quick example using sed, tr and wc (moderately portable using literal carriage returns ^M):

#!/bin/sh
in_words="[[:alnum:]_]"
in_punct="[][{}\\|:\"';<>,./?\`~!@#$%^&*()+=-]"
sed     -e "s/\($in_words\)\($in_punct\)/\1^M\2/g" \
        -e "s/\($in_punct\)\($in_words\)/\1^M\2/g" \
        -e "s/[[:space:]]/^M/g" \
        "$@" |
tr '\r' '\n' |
sed     -e '/^$/d' |
wc      -l

Comparing counts:

Running the script on itself, gives me 76 words.
The example in Perl by @cuonglm gives 31.
Using wc gives 28.

For reference, POSIX vi says:

In the POSIX locale, vi shall recognize five kinds of words:

A maximal sequence of letters, digits, and underscores, delimited at both ends by:

Characters other than letters, digits, or underscores

The beginning or end of a line

The beginning or end of the edit buffer

A maximal sequence of characters other than letters, digits, underscores, or characters, delimited at both ends by:

A letter, digit, underscore

<blank> characters

The beginning or end of a line

The beginning or end of the edit buffer

One or more sequential blank lines

The first character in the edit buffer

The last non-<newline> in the edit buffer

score 3 · Answer 5 · edited Jun 22 '13 at 12:20

3

The better solution is using Perl:

perl -nle '$word += scalar(split(/\s+/, $_)); END{print $word}' filename

@Bernhard

You can check the source code of wc command from coreutils, I test in my machine, with file subst.c in bash 4.2 source.

time wc -w subst.c

real    0m0.025s
user    0m0.016s
sys     0m0.000s

And

time perl -nle '$word += scalar(split(" ", $_)); END{print $word}' subst.c

real    0m0.021s
user    0m0.016s
sys     0m0.004s

The bigger the file is, the more efficient Perl is with respect to wc.

edited Jun 22 '13 at 12:20

manatwork

30,549
7
101
91

answered Jun 19 '13 at 17:11

cuonglm

150,973
38
327
406

13

Why is this better than wc? – Sparr Jun 19 '13 at 17:13
`wc` need open entire file before it processes, it will cause problem when you work with large file. – cuonglm Jun 19 '13 at 17:20
@Gnouc I'll upvote if you show proof – Bernhard Jun 19 '13 at 17:23
2

@Sparr for one thing because, to my very great surprise, it seems to be _much_ faster. I tried it on a text file with 141813504 words and `wc` took ~14sec while Perl took ~5sec! – terdon Jun 19 '13 at 17:28
@Gnouc I don't think that is true, I routinely use `wc` on text files of a few GB in size and have no problems. – terdon Jun 19 '13 at 17:28
3

I think the 'bigger' issue really is an answer that has a dependency on Perl and I'm never a big fan of such a dependency. If the question was about performance that would be another thing. – Michael Durrant Jun 19 '13 at 17:31
@MichaelDurrant is there any chance of ever finding a *nix machine without Perl? – terdon Jun 19 '13 at 17:34
I don't know. I don't know what the prevalence of Perl will be in 2 or 3 years or if a version of Perl will come up with a simple but important syntax change for simple commands or.... basically I don't know what the future will bring so I try to avoid dependencies. I well remember the "but 98% of people use Internet Explorer, why do you care about details that affect other browsers?" era when I was building the foundation of products that were intended to work a couple of years later. – Michael Durrant Jun 19 '13 at 17:38
@terdon: I think problem only occurs when you don't have enough memory for file with GB size. – cuonglm Jun 19 '13 at 17:38
It used to be the case that literally no Solaris machine included Perl. Any solution that requires a programming language such as Python, Perl, or Ruby should be taken with a grain of salt, that it might *not* be there. – slm Jun 19 '13 at 17:40
@Gnouc using `wc (GNU coreutils) 8.13` I could run `wc -w` on a 11GB text file on my laptop (8G RAM) with no problem, and no discernible spike in memory usage. It probably uses tmp files and not RAM. – terdon Jun 19 '13 at 17:55
5

Note that _a `split` on `/\s+/` is like a `split(' ')` except that any leading whitespace produces a null first field. That difference will give you one extra word (the null first field, that is) per line_ [link](http://stackoverflow.com/a/6063805/1601027). So use `(split(" ", $_))` otherwise for a file created like this: `echo -e "unix\n linux" > testfile` your one-liner reports 3 words. – don_crissti Jun 19 '13 at 18:03
1

Your timings show that wc is quicker (it's user and sys times that matter there). With LC_ALL=C, `wc` will be significantly quicker, just like with `PERLIO=:utf8`, `perl` will be significantly slower. – Stéphane Chazelas Jun 28 '13 at 10:20
1

And no, `wc` doesn't load the entire file in memory before starting counting (easily checked with `yes | wc -w` not using up any memory). `perl` would have the issue though if the file was just one big line. – Stéphane Chazelas Jun 28 '13 at 10:32
is there a similar way for wc -l ? – ziulfer Jun 11 '14 at 09:09
Using Perl's autosplit seemed significantly slower: $ time perl -lane '$word += scalar(@F); END {print $word}' lotofwords.txt (lotofwords.txt had 200,000+ words) – AAAfarmclub Jul 08 '17 at 20:25

how to count total number of words in a file?

5 Answers5

example

Let's use AWK!

Linked