Get count of occurrences of each word in document

Question

How can I find count of every word in a file?

I want a histogram of each word in text pipe or document. New line and empty lines will exist in document. I stripped everything except for [a-zA-Z].

> cat doc.txt 
word second third 

word really
> cat doc.txt | ... # then count occurrences of each word \
                    # and print in descending order separated by delimiter
word 2
really 1
second 1
third 1

It needs to be somewhat efficient as file is 1GB text and cannot work with exponential time load.

@glennjackman perl is perfectly fine. fish is there to make sure it work on fish and isn't just bash/zsh specific answer as those aren't that useful for me — user14492, Aug 12 '20 at 15:26
@Kusalananda like I said only `[a-zA-Z]` so hypen cannot exist; only letters small and capital case :) What about GNU/Linux. I chose to tag macOS because I want to make sure that people assume macOS flavor of tools and only expect macOS default tools to exist (installing a separate tool is overkill imo). — user14492, Aug 12 '20 at 20:22
@user14492 Ah, I missed the bit where you deleted all non-letters. By `GNU/Linux` I meant to ask whether it was to be counted as one or two words, but by deleting the non-letter `/` it's clear that it's a single word. — Kusalananda, Aug 12 '20 at 21:21

pLumo · Answer 1 · 2020-08-12T14:44:30.380

6

Try this:

grep -o '\w*' doc.txt | sort | uniq -c | sort -nr

-o Print each match instead of matching lines
\w* Match word characters
sort sort the matches before piping to uniq.
uniq -c print the uniqe lines and the number of occurences -c
sort -nr Reverse sort by number of occurences.

Output:

  2 word
  1 third
  1 second
  1 really

Alternative:

Use awk for the exact output:

$ grep -o '\w*' doc.txt \
| awk '{seen[$0]++} END{for(s in seen){print s,seen[s]}}' \
| sort -k2r

word 2
really 1
second 1
third 1

edited Aug 12 '20 at 14:44

answered Aug 12 '20 at 14:33

pLumo

22,231
2
41
66

With GNU awk, don't need external sort: `END {PROCINFO["sorted_in"] = "@val_num_desc"; for (s in seen) print s, seen[s]}` – glenn jackman Aug 12 '20 at 16:35

glenn jackman · Answer 2 · 2020-08-12T16:42:15.280

0

perl -lnE '
  $count{$_}++ for /[[:alpha:]]+/g;
  END {
    say "@$_" for
      sort {$b->[1] <=> $a->[1] || $a->[0] cmp $b->[0]}
      map {[$_, $count{$_}]}
      keys %count
  }
' doc.txt

This will consume lots more memory than pLumo's initial solution.

edited Aug 12 '20 at 16:42

answered Aug 12 '20 at 15:11

glenn jackman

84,176
15
116
168

Get count of occurrences of each word in document

2 Answers2