13

I'd like to know the equivalent of

cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c

presented in https://stackoverflow.com/questions/4174113/how-to-gather-characters-usage-statistics-in-text-file-using-unix-commands for production of character usage statistics in text files for binary files counting simple bytes instead of characters, i.e. output should be in the form of

18383 57
12543 44
11555 127
 8393 0

It doesn't matter if the command takes as long as the referenced one for characters.

If I apply the command for characters to binary files the output contains statistics for arbitrary long sequences of unprintable characters (I don't seek explanation for that).

Kalle Richter
  • 2,100
  • 4
  • 20
  • 37

5 Answers5

10

With GNU od:

od -vtu1 -An -w1 my.file | sort -n | uniq -c

Or more efficiently with perl (also outputs a count (0) for bytes that don't occur):

perl -ne 'BEGIN{$/ = \4096};
          $c[$_]++ for unpack("C*");
          END{for ($i=0;$i<256;$i++) {
              printf "%3d: %d\n", $i, $c[$i]}}' my.file
Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
  • In order to get the numbers in the first row recognized correctly I had to add `| sort -n` and `| sort -n -r` for descending order respectively (sorting was not part of the question). Sorting might be done better... – Kalle Richter Sep 23 '14 at 17:28
  • Seems a little overkill to have to sort the entire file, but worked OK for me. – Michael Anderson May 25 '15 at 05:39
  • Good point @Karl, though not requested, using `sort -n` here makes a lot more sense. Answer updated. – Stéphane Chazelas Jun 16 '15 at 13:44
5

For large files using sort will be slow. I wrote a short C program to solve the equivalent problem (see this gist for Makefile with tests):

#include <stdio.h>

#define BUFFERLEN 4096

int main(){
    // This program reads standard input and calculate frequencies of different
    // bytes and present the frequences for each byte value upon exit.
    //
    // Example:
    //
    //     $ echo "Hello world" | ./a.out
    //
    // Copyright (c) 2015 Björn Dahlgren
    // Open source: MIT License

    long long tot = 0; // long long guaranteed to be 64 bits i.e. 16 exabyte
    long long n[256]; // One byte == 8 bits => 256 unique bytes

    const int bufferlen = BUFFERLEN;
    char buffer[BUFFERLEN];
    int i;
    size_t nread;

    for (i=0; i<256; ++i)
        n[i] = 0;

    do {
        nread = fread(buffer, 1, bufferlen, stdin);
        for (i = 0; i < nread; ++i)
            ++n[(unsigned char)buffer[i]];
        tot += nread;
    } while (nread == bufferlen);
    // here you may want to inspect ferror of feof

    for (i=0; i<256; ++i){
        printf("%d ", i);
        printf("%f\n", n[i]/(float)tot);
    }
    return 0;
}

usage:

gcc main.c
cat my.file | ./a.out
David Foerster
  • 1,505
  • 1
  • 11
  • 18
Bjoern Dahlgren
  • 151
  • 1
  • 4
  • Do you have a test? There're no comments in the code. It's in general not a good idea to use untested and publish untested or uncommented code - not matter whether it's common practice. The possibility to review revisions is also limited on this platform, consider an explicit code hosting platform. – Kalle Richter Jun 16 '15 at 09:22
  • @KarlRichter tests were a good idea to add. I found the old version choked on '\0' characters. This version should work (passes a few basic tests at least). – Bjoern Dahlgren Jun 16 '15 at 12:14
  • 1
    `fgets` gets a line, not a buffer-full. You're scanning the 4096-byte full buffer for each line read from stdin. You need `fread` here, not `fgets`. – Stéphane Chazelas Jun 16 '15 at 13:43
  • @StéphaneChazelas great - didn't know of fread (seldom do I/O from C). updated example to use fread instead. – Bjoern Dahlgren Jun 17 '15 at 11:00
  • I've added an `if` block around the printf statements, that makes the output more readable if some bytes don't occur in the input file: https://gist.github.com/martinvonwittich/2f0a9e5dcf63cb316765260694b73172 – Martin von Wittich Jun 04 '16 at 11:08
3

As mean, sigma and CV are often important when judging statistic data of the content of binary files, I've created a cmdline program that graphs all this data as an ascii circle of byte deviations from sigma.
http://wp.me/p2FmmK-96
It can be used with grep, xargs and other tools to extract statistics. enter image description here

circulosmeos
  • 281
  • 1
  • 4
  • 4
2

The recode program can do this quickly even for large files, either frequency statistics either for bytes or for the characters of various character sets. E.g. to count byte frequencies:

$ echo hello there > /tmp/q
$ recode latin1/..count-characters < /tmp/q
1  000A LF   1  0020 SP   3  0065 e    2  0068 h    2  006C l    1  006F o
1  0072 r    1  0074 t

Caution - specify your file to recode as standard input, otherwise it will silently replace it with the character frequencies!

Use recode utf-8/..count-characters < file to treat the input file as utf-8. Many many other character sets are available, and it will fail if the file contains any illegal characters.

nealmcb
  • 766
  • 9
  • 16
1

This is similar to Stephane's od answer but it shows the ASCII value of the byte. It is also sorted by frequency / number of occurences.

xxd -c1 my.file|cut -c10-|sort|uniq -c|sort -nr

I don't think this is efficient since many processes are started but it's good for single files, particularly small files.

brendan
  • 175
  • 9