6

Because that's what some of them are doing.

> echo echo Hallo, Baby! | iconv -f utf-8 -t utf-16le > /tmp/hallo
> chmod 755 /tmp/hallo
> dash /tmp/hallo
Hallo, Baby!
> bash /tmp/hallo
/tmp/hallo: /tmp/hallo: cannot execute binary file
> (echo '#'; echo echo Hallo, Baby! | iconv -f utf-8 -t utf-16le) > /tmp/hallo
> bash /tmp/hallo
Hallo, Baby!
> mksh /tmp/hallo
Hallo, Baby!
> cat -v /tmp/hallo
#
e^@c^@h^@o^@ ^@H^@a^@l^@l^@o^@,^@ ^@B^@a^@b^@y^@!^@
^@

Is this some compatibility nuisance actually required by the standard? Because it looks quite dangerous and unexpected.

  • 1
    The standard doesn't allow NULs in scripts; see [here](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sh.html#tag_20_117_07), and [here](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403). –  Nov 26 '19 at 08:04
  • 1
    I also didn't understand why this is *quite dangerous* –  Nov 26 '19 at 10:14
  • Note that of those shells you mention, only `bash` has ever been certified as POSIX compliant (some version, in some specific environment, with specific compilation flags though). So if `bash` doesn't do _something_ (at least when invoked as `sh`), it's likely that _something_ is not a POSIX requirement. – Stéphane Chazelas Nov 26 '19 at 14:47
  • 2
    In using the phrase "NUL bytes" this question is conflating two different things. `NUL` is a character name. The text files in question have _characters_ whose _multiple-byte encodings_ contain _bytes_ with the value zero, but those are **not** `NUL` characters in UTF16; and the text files in question contain **no** `NUL` _characters_ at all. Better questions would be whether this behaviour is conformant in the "POSIX" locale, what locales in practice allow text files to be encoded as UTF16, and why `cat -v` is not showing the zero bytes after the 0x23 and 0x0A bytes in the first line. – JdeBP Nov 26 '19 at 15:39
  • 1
    @JdeBP no, it has nothing to do with locales, utf-16 or unicode; I've used `iconv` just for convenience; the shell will ignore any number of NUL **bytes**: `(printf ec; dd if=/dev/zero count=1024; echo ho doh) | dash`. The misleading [unicode] tag wasn't by me. –  Nov 27 '19 at 00:41
  • @oguzismail there's an *infinite* number of ways of writing `LD_PRELOAD` or any command, variable name or arguments; that may defeat a "naive" auditing. –  Nov 27 '19 at 00:49
  • @JdeBP and `cat -v` is not showing the zero/nul bytes after 0x23 and 0x0a because they're not there ;-) –  Nov 27 '19 at 02:54
  • There are no zero bytes after 0x23 and 0x0A bytes. @JdeBP –  Nov 27 '19 at 03:08
  • Nah, I'm not following. Shells have to remove NUL bytes in order for paths/command names to work. –  Nov 27 '19 at 05:03
  • 1
    @oguzismail if you're searching for `evil_shit` in some scripts, you'll have to search for `e\0*v\0*i\0*l\0*_\0*s\0*h\0*i\0*t\0*` instead. And that's just one way of looking at it. –  Nov 27 '19 at 06:20
  • Now it makes sense, yeah. –  Nov 27 '19 at 06:25
  • @localuser It's almost exactly like ignoring UTF-8 overlong forms and just decoding them. – Kaz Nov 27 '19 at 07:44

2 Answers2

11

As per POSIX,

input file shall be a text file, except that line lengths shall be unlimited¹

NUL characters² in the input make it non-text, so the behaviour is unspecified as far as POSIX is concerned, so sh implementations can do whatever they want (and a POSIX compliant script must not contain NULs).

There are some shells that scan the first few bytes for 0s and refuse to run the script on the assumption that you tried to execute a non-script file by mistake.

That's useful because the exec*p() functions, env commands, sh, find -exec... are required to call a shell to interpret a command if the system returns with ENOEXEC upon execve(), so, if you try to execute a command for the wrong architecture, it's better to get a won't execute a binary file error from your shell than the shell trying to make sense of it as a shell script.

That is allowed by POSIX:

If the executable file is not a text file, the shell may bypass this command execution.

Which in the next revision of the standard will be changed to:

The shell may apply a heuristic check to determine if the file to be executed could be a script and may bypass this command execution if it determines that the file cannot be a script. In this case, it shall write an error message, and shall return an exit status of 126.
Note: A common heuristic for rejecting files that cannot be a script is locating a NUL byte prior to a <newline> byte within a fixed-length prefix of the file. Since sh is required to accept input files with unlimited line lengths, the heuristic check cannot be based on line length.

That behaviour can get in the way of shell self-extractable archives though which contain a shell header followed by binary data¹.

The zsh shell supports NUL in its input, though note that NULs can't be passed in the arguments of execve(), so you can only use it in the argument or names of builtin commands or functions:

$ printf '\0() echo zero; \0\necho \0\n' | zsh | hd
00000000  7a 65 72 6f 0a 00 0a                              |zero...|
00000007

(here defining and calling a function with NUL as its name and passing a NUL character as argument to the builtin echo command).

Some will strip them which is also a sensible thing to do. NULs are sometimes used as padding. They are ignored by terminals for instance (they were sometimes sent to terminals to give them time to process complex control sequences (like carriage return (literally)). Holes in files appear as being filled with NULs, etc.

Note that non-text is not limited to NUL bytes. It's also sequence of bytes that don't form valid characters in the locale. For instance, the 0xc1 byte value cannot occur in UTF-8 encoded text. So in locales using UTF-8 as the character encoding, a file that contains such a byte is not a valid text file and therefore not a valid sh script³.

In practice, yash is the only shell I know that will complain about such invalid input.


¹ In the next revision of the standard, it is going to change to

The input file may be of any type, but the initial portion of the file intended to be parsed according to the shell grammar (XREF to XSH 2.10.2 Shell Grammar Rules) shall consist of characters and shall not contain the NUL character. The shell shall not enforce any line length limits.

explicitly requiring shells to support input that starts with a syntactically valid section without NUL bytes, even if the rest contains NULs, to account for self-extracting archives.

² and characters are meant to be decoded as per the locale's character encoding (see the output of locale charmap), and on POSIX system, the NUL character (whose encoding is always byte 0) is the only character whose encoding contains the byte 0. In other words, UTF-16 is not among the character encodings that can be used in a POSIX locale.

³ There is however the question of the locale changing within the script (like when the LANG/LC_CTYPE/LC_ALL/LOCPATH variables are assigned) and at which point the change takes effect for the shell interpreting the input.

Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
  • Hmm, I wonder why its 'NUL byte prior to a newline'? Unless it's exactly for self-extracting files, which probably have a newline first and the binary data only after that. – ilkkachu Nov 26 '19 at 12:11
  • 1
    @ilkkachu, yes see edit. – Stéphane Chazelas Nov 26 '19 at 13:34
  • See my answer, the behavior is not related to POSIX but rather to an implementation detail. – schily Nov 26 '19 at 14:20
  • @schily, the behaviour is _allowed_ by POSIX, which is the point I'm making here. Your answer is also useful to explain why some implementation chose (or not) to behave one way or another on those invalid inputs. – Stéphane Chazelas Nov 26 '19 at 14:25
  • Text files contain characters. You need to go one further and look at [the definition of "character"](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87). The question asked about NUL _bytes_. You've answered about NUL _characters_. If according to the standard multi-byte characters (other than NUL) _may_ contain bytes with the value zero in their encodings, which would certainly be needed for UTF16 to be a conformant encoding, then text files _may_ do this, because the prohibition on NUL _characters_ is not a prohibition on _bytes_ with the value zero. – JdeBP Nov 26 '19 at 15:27
  • 1
    @JdeBP, on POSIX systems no system locale can use a charset where characters other than NUL have byte 0 in their encoding, UTF-16 cannot be used as a POSIX locale charset. So here, byte 0 or NUL character is the same thing, though I agree that the mention of UTF16 is bringing some confusion in this Q&A. – Stéphane Chazelas Nov 26 '19 at 16:50
  • The tab, cr, nl delays may give a hint to its origin, but if a fill character was used, was it always `\0`? –  Nov 27 '19 at 00:57
  • @localuser, it may be, but it's not very relevant, I just gave it as an example where NULs are ignored. At the core is the fact that Unix is written in C (C was invented *for* UNIX), where _strings_ are 0-delimited arrays of bytes. So most of the Unix API and most shells can't cope with strings that contain NUL characters. – Stéphane Chazelas Nov 27 '19 at 05:45
  • You assert that without specific reference to the standard, which does not appear to back you up. Show that explicit restriction on multiple-byte character encodings. – JdeBP Nov 27 '19 at 09:07
  • 1
    @JdeBP, https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap06.html requires NUL to be encoded as a single byte 0, and all characters from the portable charset to be encoded as a positive `char` value. POSIX bytes are 8 bit (https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/stdint.h.html). That precludes UTF-16. There may not be explicit text that says some other multibyte characters can't contain 0, but such a charset would be impractical as for instance, you couldn't pass strings encoded in that charset to `execve()` or most of the Unix API. – Stéphane Chazelas Nov 27 '19 at 09:21
-1

The reason for this behavior is a bit complex...

First, modern shells include a check for potentially binary files (that contain null bytes), but this check only verifies the first line from the file. This is why the '#' in the first line changes behavior. The historical Bourne Shell does not have that binary check and does not even need the '#' to behave the way you mentioned.

Then the specific method used by the Bourne Shell to support multi byte characters via mbtowc() simply skips all null bytes because mbtowc() returns the character length 0 for a null byte and this causes a loop to retry the next character.

The Bourne Shell introduced this kind of code around 1988 and it may be that other shells copied the behavior.

schily
  • 18,806
  • 5
  • 38
  • 60
  • That can't apply to `dash` though as `dash` is not multi-byte aware. – Stéphane Chazelas Nov 26 '19 at 14:26
  • You are correct, but this is the reason why Bourne Shell and ksh88 work this way. – schily Nov 26 '19 at 14:41
  • What we did intend with the new wording in POSIX is to permit binary content in a shell script in order to be able to implement self extracting scripts that contain e.g. a compressed TAR archive at the end. – schily Nov 26 '19 at 14:44
  • ksh93 may check more than the first line, like when the first line contains an unterminated statement (like a line containing `(`). Note that some more _modern_ shells like `fish`, `es`, `zsh` don't do that check (zsh can work with NULs) – Stéphane Chazelas Nov 26 '19 at 14:44
  • FWIW, it seems true that this exact behaviour was introduced by the original *bourne* shell (the pre-bourne shell won't allow NUL bytes within words; I have just tried both with the `apout` pdp11 user-land simulator). It's not at all clear that it was intentional, though ;-) –  Nov 27 '19 at 01:52