13

I have two files with sizes 124665 and 124858 in bytes and want to check whether file1 is a prefix of file2 or not.

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
tvorog
  • 253
  • 1
  • 5

3 Answers3

12

If your system has the cmp command from GNU diffutils, then one option is

cmp -n 124665 file1 file2

to compare at most the first 124665 bytes of the two files and report if they differ - or, more generally

cmp -n "$(wc -c < file1)" file1 file2
Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
steeldriver
  • 78,509
  • 12
  • 109
  • 152
  • @StephaneChazelas I'm second guessing myself here but would it have been better to suggest `$(stat -c %s file1)` for the size in bytes? Does `wc` actually open and process the whole file to get the byte count? – steeldriver Jun 07 '14 at 19:51
  • 2
    no, most `wc` implementations will optimise that case and do a `fstat()` (or/and a `lseek(SEEK_END)`) so will be as efficient as it gets. On the other hand, that `stat -c` is GNU specific. – Stéphane Chazelas Jun 07 '14 at 19:52
  • 1
    Although if you're going to require the GNU-specific `cmp`, you might reasonably assume GNU-specific `stat`. – Barmar Jun 11 '14 at 19:04
11

Supposing you have the size of file1 in the variable FILE1_SZ and your head implementation supports the (non-standard) -c option:

if head -c "$FILE1_SZ" file2 | cmp -s - file1; then
    echo "file1 is a prefix of file2"
else
    echo "file1 is not a prefix of file2"
fi
Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
Joseph R.
  • 38,849
  • 7
  • 107
  • 143
  • @StéphaneChazelas Can you please explain why `cmp` would be better than `diff` here? – Joseph R. Jun 07 '14 at 19:40
  • 7
    Because `cmp` does a simple byte to byte comparison, and returns as soon as it finds a difference, while `diff` is a text utility that is going to use a complex algorithm to show you all the differences between the two files which you don't care about. – Stéphane Chazelas Jun 07 '14 at 20:02
3

GNU cmp can solve the problem in an easier way:

cmp file1 file2

There are four possible outputs (barring some sort of error).

  • No output: the files are identical.

  • cmp: EOF on file1: file1 is a prefix of file2.

  • cmp: EOF on file2: file2 is a prefix of file1.

  • file1 file2 differ: byte NNN, line MMM: Neither is a prefix of the other.

Unfortunately this is a little awkward to use in a script, since these cases don't seem to be distinguished in the exit code. Moreover, the EOF on file1 messages go to stderr, while the file1 file2 differ message goes to stdout.

I presume that other versions of cmp do something similar, but I have not checked.

Nate Eldredge
  • 951
  • 8
  • 12
  • 1
    `cmp` is not a GNU-only command nor did it originate there, it was already in the first version of Unix in the early 70s. The `-n` option is GNU specific though. – Stéphane Chazelas Jun 07 '14 at 19:26
  • You could do `cmp file1 file2 2>&1 | grep EOF on file1` – David Z Jun 08 '14 at 01:39
  • @StéphaneChazelas: That is true. I didn't mean to imply that `cmp` was unique to GNU, just that GNU `cmp` was the only version I tried. I added a sentence to clarify. – Nate Eldredge Jun 08 '14 at 04:19
  • @DavidZ: Yes, you could, but it gets a little less robust. Imagine that you are trying to do this with two files supplied by the user, and one of them is named `file1` and the other is named `file12`. (Or worse yet, what if the second file is named `EOF on file1`?) Solving this robustly using `cmp` is probably much more trouble than writing the obvious 5-line program in C... – Nate Eldredge Jun 08 '14 at 04:23
  • There may be contexts where a C program isn't practical, though. And it's not that hard to make it fairly robust, because the output of `cmp` is so tightly constrained. Using the `-x` option on `grep` to match the entire line will take care of all but the most exotic cases (e.g. newlines in the filename). – David Z Jun 08 '14 at 04:29