325

I have Apache logfile, access.log, how to count number of line occurrence in that file? for example the result of cut -f 7 -d ' ' | cut -d '?' -f 1 | tr '[:upper:]' '[:lower:]' is

a.php
b.php
a.php
c.php
d.php
b.php
a.php

the result that I want is:

3 a.php
2 b.php
1 d.php # order doesn't matter
1 c.php 
Kokizzu
  • 9,257
  • 12
  • 55
  • 82

5 Answers5

432
| sort | uniq -c

As stated in the comments.

Piping the output into sort organises the output into alphabetical/numerical order.

This is a requirement because uniq only matches on repeated lines, ie

a
b
a

If you use uniq on this text file, it will return the following:

a
b
a

This is because the two as are separated by the b - they are not consecutive lines. However if you first sort the data into alphabetical order first like

a
a
b

Then uniq will remove the repeating lines. The -c option of uniq counts the number of duplicates and provides output in the form:

2 a
1 b

References:

Jonathon Reinhart
  • 1,821
  • 1
  • 16
  • 20
visudo
  • 4,631
  • 1
  • 11
  • 5
  • 1
    Welcome to Unix & Linux :) Don't hesitate to add more details to your answer and explain why and how this works ;) – John WH Smith Nov 26 '14 at 12:18
  • 1
    `printf '%s\n' ①.php ②.php | sort | uniq -c` gives me `2 ①.php` – Stéphane Chazelas Nov 26 '14 at 12:50
  • @StéphaneChazelas Thats because the printf prints `php\nphp` –  Nov 26 '14 at 13:52
  • 6
    @Jidder, no, that's because `①.php` sorts the same as `②.php` in my locale because no sorting order is defined for those `①` and `②` character in my locale. If you want _unique_ values for any byte values (remember file paths are not necessarily text), then you need to fix the locale to C: `| LC_ALL=C sort | LC_ALL=C uniq -c`. – Stéphane Chazelas Nov 26 '14 at 14:00
  • 4
    In order to have the resulting count file sorted you should consider adding the "sort -nr" as @eduard-florinescu answers below. – Lluís Suñol Mar 26 '18 at 11:41
224
[your command] | sort | uniq -c | sort -nr

The accepted answer is almost complete you might want to add an extra sort -nr at the end to sort the results with the lines that occur most often first

uniq options:

-c, --count
       prefix lines by the number of occurrences

sort options:

-n, --numeric-sort
       compare according to string numerical value
-r, --reverse
       reverse the result of comparisons

In the particular case were the lines you are sorting are numbers, you need use sort -gr instead of sort -nr, see comment

Eduard Florinescu
  • 11,153
  • 18
  • 57
  • 67
  • 3
    Thanks so much for letting me know about `-n` option. – Sigur Nov 30 '16 at 17:00
  • 4
    Great answer, here's what I use to get a wordcount out of file with sentences: `tr ' ' '\n' < $FILE | sort | uniq -c | sort -nr > wordcount.txt`. The first command replaces spaces with newlines, allowing for the rest of the command to work as expected. – Bar Jul 20 '17 at 00:08
  • 5
    Using the options above I get " 1" before " 23344". Using `sort -gr` instead solves this. `-g`: compare according to general numerical value (instead of `-n`: compare according to string numerical value). – Peter Jaric Feb 14 '19 at 12:24
  • 1
    @PeterJaric Great catch and very useful to know about `-gr` but I think the output of `uniq -c` will be as such that `sort -nr` will work as intended – Eduard Florinescu Feb 14 '19 at 13:09
  • 6
    Actually, when the data are numbers, `-gr` works better. Try these two examples, differing only in the g and n flags: `echo "1 11 1 2" | tr ' ' '\n' | sort | uniq -c | sort -nr` and `echo "1 11 1 2" | tr ' ' '\n' | sort | uniq -c | sort -gr`. The first one sorts incorrectly, but not the second one. – Peter Jaric Feb 15 '19 at 10:31
  • @you are right, this is a corner case but I will mention it in the answer – Eduard Florinescu Feb 15 '19 at 13:06
  • `sort -g` and `sort -n` give me the same output for the given example on coreutils 9.1 (also tested with LC_ALL=C). – TheHardew May 25 '22 at 18:55
23

You can use an associative array on awk and then -optionally- sort:

$ awk ' { tot[$0]++ } END { for (i in tot) print tot[i],i } ' access.log | sort

output:

1 c.php
1 d.php
2 b.php
3 a.php
slm
  • 363,520
  • 117
  • 767
  • 871
  • How would you count the number of occurrences as the pipe is sending data? – user123456 Oct 09 '16 at 18:00
  • 6
    This approach is very valuable if the input list is very large, because it does not require reading the entire list into memory and then sorting it. – neirbowj Nov 03 '19 at 18:01
5

You can use clickhouse-client tool for working with files like with a sql table with a single column in this case:

clickhouse-local --query \
"select data, count() from file('access.log', TSV, 'data String') group by data order by count(*) desc limit 10"

My brief experiment shows it's about 50 times faster than

cat access.log | sort | uniq -c | sort -nr | head 10
AdminBee
  • 21,637
  • 21
  • 47
  • 71
0

There is only 1 sample for d.php. So you'll get nice output like this.

wolf@linux:~$ cat file | sort | uniq -c
      3 a.php
      2 b.php
      1 c.php
      1 d.php
wolf@linux:~$

What happens when there is 4 d.php?

wolf@linux:~$ cat file | sort | uniq -c
      3 a.php
      2 b.php
      1 c.php
      4 d.php
wolf@linux:~$ 

If you want to sort the output by the number of occurrence, you might want to send the stdout to sort again.

wolf@linux:~$ cat file | sort | uniq -c | sort
      1 c.php
      2 b.php
      3 a.php
      4 d.php
wolf@linux:~$ 

Use -r for reverse

wolf@linux:~$ cat file | sort | uniq -c | sort -r
      4 d.php
      3 a.php
      2 b.php
      1 c.php
wolf@linux:~$ 

Hope this example helps

Wolf
  • 1,501
  • 2
  • 15
  • 37