head eats extra characters

Question

The following shell command was expected to print only odd lines of the input stream:

echo -e "aaa\nbbb\nccc\nddd\n" | (while true; do head -n 1; head -n 1 >/dev/null; done)

But instead it just prints the first line: aaa.

The same doesn't happen when it is used with -c (--bytes) option:

echo 12345678901234567890 | (while true; do head -c 5; head -c 5 >/dev/null; done)

This command outputs 1234512345 as expected. But this works only in the coreutils implementation of the head utility. The busybox implementation still eats extra characters, so the output is just 12345.

I guess this specific way of implementation is done for optimization purposes. You can't know where the line ends, so you don't know how many characters you need to read. The only way not to consume extra characters from the input stream is to read the stream byte by byte. But reading from the stream one byte at a time may be slow. So I guess head reads the input stream to a big enough buffer and then counts lines in that buffer.

The same can't be said for the case when --bytes option is used. In this case you know how many bytes you need to read. So you may read exactly this number of bytes and not more than that. The corelibs implementation uses this opportunity, but the busybox one does not, it still reads more byte than required into a buffer. It is probably done to simplify the implementation.

So the question. Is it correct for the head utility to consume more characters from the input stream than it was asked? Is there some kind of standard for Unix utilities? And if there is, does it specify this behavior?

PS

You have to press Ctrl+C to stop the commands above. The Unix utilities do not fail on reading beyond EOF. If you don't want to press, you may use a more complex command:

echo 12345678901234567890 | (while true; do head -c 5; head -c 5 | [ `wc -c` -eq 0 ] && break >/dev/null; done)

which I didn't use for simplicity.

Neardupe https://unix.stackexchange.com/questions/48777/command-to-display-first-few-and-last-few-lines-of-a-file and https://unix.stackexchange.com/questions/84011/reusing-pipe-data-for-different-commands . Also, if this title had been on movies.SX my answer would be _Zardoz_ :) — dave_thompson_085, Dec 08 '17 at 01:16

Stephen Kitt · Accepted Answer · 2017-12-07T13:07:30.617

30

Is it correct for the head utility to consume more characters from the input stream than it was asked?

Yes, it’s allowed (see below).

Is there some kind of standard for Unix utilities?

Yes, POSIX volume 3, Shell & Utilities.

And if there is, does it specify this behavior?

It does, in its introduction:

When a standard utility reads a seekable input file and terminates without an error before it reaches end-of-file, the utility shall ensure that the file offset in the open file description is properly positioned just past the last byte processed by the utility. For files that are not seekable, the state of the file offset in the open file description for that file is unspecified.

head is one of the standard utilities, so a POSIX-conforming implementation has to implement the behaviour described above.

GNU head does try to leave the file descriptor in the correct position, but it’s impossible to seek on pipes, so in your test it fails to restore the position. You can see this using strace:

$ echo -e "aaa\nbbb\nccc\nddd\n" | strace head -n 1
...
read(0, "aaa\nbbb\nccc\nddd\n\n", 8192) = 17
lseek(0, -13, SEEK_CUR)                 = -1 ESPIPE (Illegal seek)
...

The read returns 17 bytes (all the available input), head processes four of those and then tries to move back 13 bytes, but it can’t. (You can also see here that GNU head uses an 8 KiB buffer.)

When you tell head to count bytes (which is non-standard), it knows how many bytes to read, so it can (if implemented that way) limit its read accordingly. This is why your head -c 5 test works: GNU head only reads five bytes and therefore doesn’t need to seek to restore the file descriptor’s position.

If you write the document to a file, and use that instead, you’ll get the behaviour you’re after:

$ echo -e "aaa\nbbb\nccc\nddd\n" > file
$ < file (while true; do head -n 1; head -n 1 >/dev/null; done)
aaa
ccc

edited Dec 07 '17 at 13:07

answered Dec 07 '17 at 12:53

Stephen Kitt

411,918
54
1,065
1,164

2

One can use the `line` (now removed from POSIX/XPG but still available on many systems) or `read` (`IFS= read -r line`) utilities instead which read one byte at a time to avoid the problem. – Stéphane Chazelas Dec 07 '17 at 13:01
3

Note that whether `head -c 5` will read 5 bytes or a full buffer depends on the implementation (also note that `head -c` is not standard), you can't rely on that. You'd need `dd bs=1 count=5` to have a guarantee that no more than 5 bytes will be read. – Stéphane Chazelas Dec 07 '17 at 13:05
Thanks @Stéphane, I’ve updated the `-c 5` description. – Stephen Kitt Dec 07 '17 at 13:07
Note that the `head` builtin of `ksh93` reads one byte at a time with `head -n 1` when the input is not seekable. – Stéphane Chazelas Dec 07 '17 at 13:11
@Stéphane Chazelas, thank you for pointing to `dd` utility (I wanted to use it, but incorrectly thought it didn't work with pipes). It does the job. Originally I needed to discard some parts of binary data (discard UV-component of yuv420p frames keeping just Y-component) generated by one program and pass the processed data (grayscale frames) to second program. I had to modify the second program to work with unmodified data. And I know now that I could just use `dd`. – anton_rh Dec 07 '17 at 14:24
@anton_rh heh, so you’d have had better answers if you’d explained what you were really trying to achieve ;-). `dd` is your best bet for byte manipulation, among the standard utilities. – Stephen Kitt Dec 07 '17 at 14:26
@Stéphane Chazelas, yes, probably :) But I solved my problem anyway (by modifying the second program). So I was just interested only in the theoretical part of the problem. And the Stephen's great answer completely answers to my question. – anton_rh Dec 07 '17 at 14:37
1

@anton_rh, `dd` only works correctly with pipes with `bs=1` if you use a `count` as reads on pipes may return less than requested (but at least one byte unless eof is reached). GNU `dd` has `iflag=fullblock` that can alleviate that though. – Stéphane Chazelas Dec 07 '17 at 19:38
@Stéphane Chazelas, oh, I understand. `count` is a number of read operations, not a number of full blocks to be read. And some read operations may result in reading less bytes than specified in `bs` (if a full block is not available at the input at the monent), and thus some blocks may be incomplete ("not full"). Thank you for this useful annotation. This changes everything. But doesn't it make copying the input stream slow? Does `bs=1` make `dd` read just one byte a once? – anton_rh Dec 08 '17 at 04:47
It seems that `bs=1` really makes reading the input byte-by-byte. I ran this command: `echo 1234567890 | strace dd bs=1` and it resulted in many `read(0, "X", 1)` operations. This is not good. But thank you anyway. – anton_rh Dec 08 '17 at 04:56

ilkkachu · Answer 2 · 2017-12-07T13:06:57.030

6

from POSIX

The head utility shall copy its input files to the standard output, ending the output for each file at a designated point.

It doesn't say anything about how much head must read from the input. Demanding it to read byte-by-byte would be silly, as it would be extremely slow in most cases.

This is, however, addressed in the read builtin/utility: all shells I can find read from pipes one byte at a time and the standard text can be interpreted to mean that this must be done, to be able read just that one single line:

The read utility shall read a single logical line from standard input into one or more shell variables.

In case of read, which is used in shell scripts, a common use case would be something like this:

read someline
if something ; then 
    someprogram ...
fi

Here, the standard input of someprogram is the same as that of the shell, but it can be expected that someprogram gets to read everything that comes after the first input line consumed by the read and not whatever was left over after a buffered read by read. On the other hand, using head as in your example is much more uncommon.

If you really want to delete every other line, it would be better (and faster) to use some tool that can handle the whole input in one go, e.g.

$ seq 1 10 | sed -ne '1~2p'   # GNU sed
$ seq 1 10 | sed -e 'n;d'     # works in GNU sed and the BSD sed on macOS

$ seq 1 10 | awk 'NR % 2' 
$ seq 1 10 | perl -ne 'print if $. % 2'

edited Dec 07 '17 at 13:06

answered Dec 07 '17 at 12:49

ilkkachu

133,243
15
236
397

But see the “INPUT FILES” section of [the POSIX introduction to volume 3](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap01.html)... – Stephen Kitt Dec 07 '17 at 12:54
1

POSIX says: *"When a standard utility reads a seekable input file and terminates without an error before it reaches end-of-file, the utility shall ensure that the file offset in the open file description is properly positioned just past the last byte processed by the utility. **For files that are not seekable, the state of the file offset in the open file description for that file is unspecified.**"* – AlexP Dec 07 '17 at 13:01
2

Note that unless you use `-r`, `read` may read more than one line (without `IFS=` it would also strip leading and trailing spaces and tabs (with the default value of `$IFS`)). – Stéphane Chazelas Dec 07 '17 at 13:08
@AlexP, yes, Stephen just linked that part. – ilkkachu Dec 07 '17 at 13:09
Note that the `head` builtin of `ksh93` reads one byte at a time with `head -n 1` when the input is not seekable. – Stéphane Chazelas Dec 07 '17 at 13:12

score 1 · Answer 3 · edited Dec 08 '17 at 16:53

1

awk '{if (NR%2) == 1) print;}'

edited Dec 08 '17 at 16:53

peterh

9,488
16
59
88

answered Dec 08 '17 at 16:14

ijbalazs

11
2

Hellóka :-) and welcome on the site! Note, we prefer the more elaborated answers. They should be useful for the googlers of the future. – peterh Dec 08 '17 at 16:53

head eats extra characters

3 Answers3

Linked