25

Why md5sum is prepending "\" in front of the checksum when finding the checksum of a file with "\" in the name?

$ md5sum /tmp/test\\test
\d41d8cd98f00b204e9800998ecf8427e  /tmp/test\\test

The same is noted for every other utility.

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
jsaji
  • 353
  • 3
  • 6
  • Just for reference, the other `*sum` utilities (of the same family as `md5sum`, e,g, `sha1sum` etc.) in GNU coreutils does the same. – Kusalananda Feb 16 '18 at 14:45
  • I don't see this behaviour, what's the version of the utility: `md5sum --version` – Kiwy Feb 16 '18 at 14:46
  • @Kusalananda This may be coreutils version specific; on CentOS 7 `cksum` doesn't; eg `% cksum test\\test 3915528286 4 test\test` – Stephen Harris Feb 16 '18 at 14:47
  • @StephenHarris That _probably_ is because `cksum` is a POSIX utility and its spec. does not allow it. – Kusalananda Feb 16 '18 at 14:49

2 Answers2

35

This is documented, for Coreutils’ md5sum:

If file contains a backslash or newline, the line is started with a backslash, and each problematic character in the file name is escaped with a backslash, making the output unambiguous even in the presence of arbitrary file names.

(file is the filename, not the file’s contents).

b2sum, sha1sum, and the various SHA-2 tools behave in the same way as md5sum. sum and cksum don’t; sum is only provided for backwards-compatibility (and its ancestors don’t produce quoted output), and cksum is specified by POSIX and doesn’t allow this type of output.

This behaviour was introduced in November 2015 and released in version 8.25 (January 2016), with the following NEWS entry:

md5sum now ensures a single line per file for status on standard output, by using a '\' at the start of the line, and replacing any newlines with '\n'. This also affects sha1sum, sha224sum, sha256sum, sha384sum and sha512sum.

The backslash at the start of the line serves as a flag: escapes in filenames are only processed if the line starts with a backslash. (Unescaping can’t be the default behaviour: it would break sums generated with older versions of Coreutils containing \\ or \n in the stored filenames.)

Stephen Kitt
  • 411,918
  • 54
  • 1,065
  • 1,164
  • 32
    It's a shame something completely unintuitive like this isn't documented in the `man` pages, though. (And yes, I'm aware GNU wants everyone to read their highly convoluted `info` pages instead.) – roaima Feb 16 '18 at 15:14
  • This doesn't address why, though. Does it? I can see backwhacking the characters in the filename, but why the sum? – msouth Feb 16 '18 at 21:30
  • 3
    @msouth the backslash at the start of the line serves as a flag indicating that backslashes in the filename are escapes; otherwise you wouldn’t know whether to process `\n` etc. as literals or escapes. – Stephen Kitt Feb 16 '18 at 21:36
  • @StephenKitt thanks. Why not put the backslash at the beginning of the filename then? If you are processing these things mechanically that seems a lot less likely to cause issues. (I realize you probably didn't write this :). I'm just curious if this is some kind of "here be dragons" convention that I'm unaware of.) – msouth Feb 16 '18 at 21:39
  • 3
    @msouth if it’s at the start of the filename, you’ve got no way of knowing whether it’s the flag, or a filename genuinely starting with a backslash... – Stephen Kitt Feb 16 '18 at 21:46
  • 1
    @StephenKitt I don't think the leading \ is there for disambiguation. There is no ambiguity if the output is documented as _always_ escaping backslashes and newlines. It's there so that de-escaping doesn't have to be done if not necessary. You can of course debate whether this is worth it (personally I think it isn't but I am not a `coreutils` contributor). – TypeIA Feb 16 '18 at 21:52
  • @TypeIA you can’t process old sum files if you unescape by default. That’s why you need a flag of some sort. – Stephen Kitt Feb 16 '18 at 22:01
  • @StephenKitt That is a good point. Although, if your parser handles output from non-escaping implementations, your parser can't resolve the ambiguity with newlines. So it seems like a bit of a no-win situation (obviously, since it's... ambiguous). Still I suppose that is a more likely reason than efficiency. I searched the relevant mailing lists and revision control histories and could not find a definitive rationale given by the original author (possibly Jim Meyering). – TypeIA Feb 16 '18 at 22:07
  • @TypeIA Filenames containing newline are rare, probably close to nonexistent except when someone is deliberately trying to confuse a script. So a parser that doesn't handle newlines probably never actually runs into problems. – Barmar Feb 16 '18 at 22:18
  • @Barmar Agreed, but "probably never" is IMO not good enough for a core system utility; certainly not good enough for the meticulous programmer(s) who added this (and even [test cases for it](https://github.com/coreutils/coreutils/commit/516e60ed10957d163c8f967e5115ccd20f26dcbf)) to GNU. – TypeIA Feb 16 '18 at 22:21
  • @TypeIA [this](http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=646902b30dee04b9454fdcaa8a30fd89fc0514ca) is the relevant change, by Pádraig Brady; I can’t find any related mailing-list discussion, I imagine the change was committed directly. – Stephen Kitt Feb 16 '18 at 22:22
  • @TypeIA and regarding ambiguity, old sums are indeed ambiguous; but at least now the tools always produce unambiguous sums (and all that can done then is process input using the new rules, but not much is lost that way). – Stephen Kitt Feb 16 '18 at 22:26
  • @TypeIA Indeed, that's why it was changed. But we don't want to break scripts that were working in safe environments. – Barmar Feb 16 '18 at 23:04
  • 2
    The documentation's phrase "each problematic character in the file name is escaped with a backslash" is wrong; replacing a newline with `\n` is not the same as escaping a newline with a backslash! – ruakh Feb 17 '18 at 21:39
  • @StephenKitt thanks. I think, if their goal was to make it easier for a parser, that they should put the flag backslash at the end of the sum, then. Then you could pull in the first X characters unconditionally, and branch only if you need to do something special with the filename. Kind of a minor quibble at this point though. Thanks for your answers. – msouth Feb 17 '18 at 23:12
17

Stephen Kitt's answer covers the what and I will try to cover why this change was implemented. First, someone observed that a filename containing newlines1 could result in ambiguous output. For example, consider this output:

d41d8cd98f00b204e9800998ecf8427e  foo
25af89c92254a806b2e93fffd8ac1814  bar

Does this mean there were two files foo and bar, or only one file whose filename is "foo\n25af89c92254a806b2e93fffd8ac1814 bar"? Granted, this latter possibility is highly unlikely, but it is possible. To resolve the ambiguity the developers chose to escape newlines with a backslash (\). The output then becomes distinguishable. However, then there is a further ambiguity:

764efa883dda1e11db47671c4a3bbd9e  foo\nbar

Does this file's name contain a newline, or a backslash followed by an n? To resolve this we need to escape backslashes too, so that the latter case becomes:

764efa883dda1e11db47671c4a3bbd9e  foo\\nbar

Finally, they elected to prepend each output line which contains such escapes with a \\ to make it easy for a parser to detect whether escaping has been done. Presumably this was done to allow parsers to handle output both from escaping versions of md5sum and from non-escaping versions (non-GNU). The flag also means that "costly" un-escaping does not need to be done when not necessary. You can see an example of this parsing in action in md5sum.c itself (line 382 in the linked version).


1 By newline I mean the character \n which is sometimes also specifically referred to as a linefeed or LF; see md5sum.c.

TypeIA
  • 269
  • 1
  • 5
  • 1
    Of course the _sane_ behaviour would be to completely **ban** every file containing a newline. Just refuse to process them. – pipe Feb 17 '18 at 14:03
  • 1
    @pipe it's _insane_ behavior. POSIX does allow such file names, and the utilities intentionally refusing to work with legitimate files are bad and must be killed with fire. – Ruslan Feb 17 '18 at 14:09
  • 2
    @Ruslan The point is to protest against POSIX for allowing such [antisocial](https://www.dwheeler.com/essays/fixing-unix-linux-filenames.html) names. Allowing such characters has likely caused a large amount of security issues and code bloat just to handle such special cases. – pipe Feb 17 '18 at 14:17
  • @pipe while LF in a file name is indeed antisocial, other things mentioned in your link are much more debatable — like spaces, non-latin letters etc.. – Ruslan Feb 17 '18 at 14:54
  • Classic over-engineering by engineers. Lesson (yet again): do not allow engineers to drive requirements. They will find the most obscure and convoluted case and elevate it to the dominate case and confuse everyone. –  Feb 17 '18 at 15:03