Sort and count number of occurrence of lines

Question

I have Apache logfile, access.log, how to count number of line occurrence in that file? for example the result of cut -f 7 -d ' ' | cut -d '?' -f 1 | tr '[:upper:]' '[:lower:]' is

a.php
b.php
a.php
c.php
d.php
b.php
a.php

the result that I want is:

3 a.php
2 b.php
1 d.php # order doesn't matter
1 c.php

Do you have an example of the line in the log, as i think this could all be done with awk without all the pipes. — , Nov 26 '14 at 13:54
it's ok, 8.1GB log file processed in about 2 minutes, and it's done for now, no longer need this anymore :3 — Kokizzu, Nov 26 '14 at 14:19
Asked one year earlier here: https://stackoverflow.com/q/15984414/322020 — Nakilon, Apr 23 '20 at 13:30
@StéphaneChazelas, sort using C locale is 10 times faster on my machine! Thanks — leogama, Jul 25 '21 at 23:18

score 432 · Accepted Answer · edited Jul 31 '19 at 14:35

432

| sort | uniq -c

As stated in the comments.

Piping the output into sort organises the output into alphabetical/numerical order.

This is a requirement because uniq only matches on repeated lines, ie

a
b
a

If you use uniq on this text file, it will return the following:

a
b
a

This is because the two as are separated by the b - they are not consecutive lines. However if you first sort the data into alphabetical order first like

a
a
b

Then uniq will remove the repeating lines. The -c option of uniq counts the number of duplicates and provides output in the form:

2 a
1 b

References:

edited Jul 31 '19 at 14:35

Jonathon Reinhart

1,821
1
16
20

answered Nov 26 '14 at 11:36

visudo

4,631
1
11
5

1

Welcome to Unix & Linux :) Don't hesitate to add more details to your answer and explain why and how this works ;) – John WH Smith Nov 26 '14 at 12:18
1

`printf '%s\n' ①.php ②.php | sort | uniq -c` gives me `2 ①.php` – Stéphane Chazelas Nov 26 '14 at 12:50
@StéphaneChazelas Thats because the printf prints `php\nphp` – Nov 26 '14 at 13:52
6

@Jidder, no, that's because `①.php` sorts the same as `②.php` in my locale because no sorting order is defined for those `①` and `②` character in my locale. If you want _unique_ values for any byte values (remember file paths are not necessarily text), then you need to fix the locale to C: `| LC_ALL=C sort | LC_ALL=C uniq -c`. – Stéphane Chazelas Nov 26 '14 at 14:00
4

In order to have the resulting count file sorted you should consider adding the "sort -nr" as @eduard-florinescu answers below. – Lluís Suñol Mar 26 '18 at 11:41

Eduard Florinescu · Answer 2 · 2019-02-15T13:09:16.707

224

[your command] | sort | uniq -c | sort -nr

The accepted answer is almost complete you might want to add an extra sort -nr at the end to sort the results with the lines that occur most often first

uniq options:

-c, --count
       prefix lines by the number of occurrences

sort options:

-n, --numeric-sort
       compare according to string numerical value
-r, --reverse
       reverse the result of comparisons

In the particular case were the lines you are sorting are numbers, you need use sort -gr instead of sort -nr, see comment

edited Feb 15 '19 at 13:09

answered Feb 17 '16 at 14:50

Eduard Florinescu

11,153
18
57
67

3

Thanks so much for letting me know about `-n` option. – Sigur Nov 30 '16 at 17:00
4

Great answer, here's what I use to get a wordcount out of file with sentences: `tr ' ' '\n' < $FILE | sort | uniq -c | sort -nr > wordcount.txt`. The first command replaces spaces with newlines, allowing for the rest of the command to work as expected. – Bar Jul 20 '17 at 00:08
5

Using the options above I get " 1" before " 23344". Using `sort -gr` instead solves this. `-g`: compare according to general numerical value (instead of `-n`: compare according to string numerical value). – Peter Jaric Feb 14 '19 at 12:24
1

@PeterJaric Great catch and very useful to know about `-gr` but I think the output of `uniq -c` will be as such that `sort -nr` will work as intended – Eduard Florinescu Feb 14 '19 at 13:09
6

Actually, when the data are numbers, `-gr` works better. Try these two examples, differing only in the g and n flags: `echo "1 11 1 2" | tr ' ' '\n' | sort | uniq -c | sort -nr` and `echo "1 11 1 2" | tr ' ' '\n' | sort | uniq -c | sort -gr`. The first one sorts incorrectly, but not the second one. – Peter Jaric Feb 15 '19 at 10:31
@you are right, this is a corner case but I will mention it in the answer – Eduard Florinescu Feb 15 '19 at 13:06
`sort -g` and `sort -n` give me the same output for the given example on coreutils 9.1 (also tested with LC_ALL=C). – TheHardew May 25 '22 at 18:55

score 23 · Answer 3 · edited Dec 13 '19 at 03:26

23

You can use an associative array on awk and then -optionally- sort:

$ awk ' { tot[$0]++ } END { for (i in tot) print tot[i],i } ' access.log | sort

output:

1 c.php
1 d.php
2 b.php
3 a.php

edited Dec 13 '19 at 03:26

slm

363,520
117
767
871

answered May 28 '15 at 04:21

Laurence R. Ugalde

461
3
5

How would you count the number of occurrences as the pipe is sending data? – user123456 Oct 09 '16 at 18:00
6

This approach is very valuable if the input list is very large, because it does not require reading the entire list into memory and then sorting it. – neirbowj Nov 03 '19 at 18:01

score 5 · Answer 4 · edited Sep 24 '21 at 14:38

5

You can use clickhouse-client tool for working with files like with a sql table with a single column in this case:

clickhouse-local --query \
"select data, count() from file('access.log', TSV, 'data String') group by data order by count(*) desc limit 10"

My brief experiment shows it's about 50 times faster than

cat access.log | sort | uniq -c | sort -nr | head 10

edited Sep 24 '21 at 14:38

AdminBee

21,637
21
47
71

answered Sep 24 '21 at 14:30

Alexey Kupershtokh

59
1
1

1

XD this is very cool use case of Clickhouse! – Kokizzu Sep 24 '21 at 15:25

score 0 · Answer 5 · answered Jun 03 '20 at 05:36

There is only 1 sample for d.php. So you'll get nice output like this.

wolf@linux:~$ cat file | sort | uniq -c
      3 a.php
      2 b.php
      1 c.php
      1 d.php
wolf@linux:~$

What happens when there is 4 d.php?

wolf@linux:~$ cat file | sort | uniq -c
      3 a.php
      2 b.php
      1 c.php
      4 d.php
wolf@linux:~$

If you want to sort the output by the number of occurrence, you might want to send the stdout to sort again.

wolf@linux:~$ cat file | sort | uniq -c | sort
      1 c.php
      2 b.php
      3 a.php
      4 d.php
wolf@linux:~$

Use -r for reverse

wolf@linux:~$ cat file | sort | uniq -c | sort -r
      4 d.php
      3 a.php
      2 b.php
      1 c.php
wolf@linux:~$

Hope this example helps

duplicated answer of https://unix.stackexchange.com/a/263849/72456 — αғsнιη, Jun 03 '20 at 05:50
what @αғsнιη comments, it is even superior, this one here is missing. — hakre, Jan 26 '21 at 10:52

Sort and count number of occurrence of lines

5 Answers5

Linked

Related