I have two files with sizes 124665 and 124858 in bytes and want to check whether file1 is a prefix of file2 or not.
3 Answers
If your system has the cmp command from GNU diffutils, then one option is
cmp -n 124665 file1 file2
to compare at most the first 124665 bytes of the two files and report if they differ - or, more generally
cmp -n "$(wc -c < file1)" file1 file2
- 522,931
- 91
- 1,010
- 1,501
- 78,509
- 12
- 109
- 152
-
@StephaneChazelas I'm second guessing myself here but would it have been better to suggest `$(stat -c %s file1)` for the size in bytes? Does `wc` actually open and process the whole file to get the byte count? – steeldriver Jun 07 '14 at 19:51
-
2no, most `wc` implementations will optimise that case and do a `fstat()` (or/and a `lseek(SEEK_END)`) so will be as efficient as it gets. On the other hand, that `stat -c` is GNU specific. – Stéphane Chazelas Jun 07 '14 at 19:52
-
1Although if you're going to require the GNU-specific `cmp`, you might reasonably assume GNU-specific `stat`. – Barmar Jun 11 '14 at 19:04
Supposing you have the size of file1 in the variable FILE1_SZ and your head implementation supports the (non-standard) -c option:
if head -c "$FILE1_SZ" file2 | cmp -s - file1; then
echo "file1 is a prefix of file2"
else
echo "file1 is not a prefix of file2"
fi
- 522,931
- 91
- 1,010
- 1,501
- 38,849
- 7
- 107
- 143
-
@StéphaneChazelas Can you please explain why `cmp` would be better than `diff` here? – Joseph R. Jun 07 '14 at 19:40
-
7Because `cmp` does a simple byte to byte comparison, and returns as soon as it finds a difference, while `diff` is a text utility that is going to use a complex algorithm to show you all the differences between the two files which you don't care about. – Stéphane Chazelas Jun 07 '14 at 20:02
GNU cmp can solve the problem in an easier way:
cmp file1 file2
There are four possible outputs (barring some sort of error).
No output: the files are identical.
cmp: EOF on file1: file1 is a prefix of file2.cmp: EOF on file2: file2 is a prefix of file1.file1 file2 differ: byte NNN, line MMM: Neither is a prefix of the other.
Unfortunately this is a little awkward to use in a script, since these cases don't seem to be distinguished in the exit code. Moreover, the EOF on file1 messages go to stderr, while the file1 file2 differ message goes to stdout.
I presume that other versions of cmp do something similar, but I have not checked.
- 951
- 8
- 12
-
1`cmp` is not a GNU-only command nor did it originate there, it was already in the first version of Unix in the early 70s. The `-n` option is GNU specific though. – Stéphane Chazelas Jun 07 '14 at 19:26
-
-
@StéphaneChazelas: That is true. I didn't mean to imply that `cmp` was unique to GNU, just that GNU `cmp` was the only version I tried. I added a sentence to clarify. – Nate Eldredge Jun 08 '14 at 04:19
-
@DavidZ: Yes, you could, but it gets a little less robust. Imagine that you are trying to do this with two files supplied by the user, and one of them is named `file1` and the other is named `file12`. (Or worse yet, what if the second file is named `EOF on file1`?) Solving this robustly using `cmp` is probably much more trouble than writing the obvious 5-line program in C... – Nate Eldredge Jun 08 '14 at 04:23
-
There may be contexts where a C program isn't practical, though. And it's not that hard to make it fairly robust, because the output of `cmp` is so tightly constrained. Using the `-x` option on `grep` to match the entire line will take care of all but the most exotic cases (e.g. newlines in the filename). – David Z Jun 08 '14 at 04:29