Utility to buffer an unbounded amount of data in a pipeline?

Question

Is there a utility that I can stick in a pipeline to decouple read and write speeds?

$ producer | buf | consumer

Basically, I want a utility buf that reads its input as fast as possible, storing it in memory so consumer can take its sweet time while producer runs as fast as possible.

The `stdbuf` tool appears to be a `size` parameter. I'm not sure if it works though. — CMCDragonkai, Jul 06 '18 at 02:58

score 14 · Answer 1 · answered Oct 03 '11 at 04:12

14

The pv (pipe viewer) utility can do this (with the -B option) and a lot more, including giving you progress reports.

answered Oct 03 '11 at 04:12

David Schwartz

5,294
22
27

Is there a way to do this with an unbounded amount of data? As best as I can tell, I need to supply a number with -B and if the producer gets that far ahead of the consumer, the producer will slow down again. If you're in a situation where there are multiple consumers (`producer | tee >(pv -cB $SIZE | consumer1) | pv -cB $SIZE2 | consumer2`), this can cause slowdowns again. – Daniel H Jun 17 '13 at 16:40
I've used `pv` hundreds of times and never knew this. Very awesome, Thank you! – Rucent88 Aug 08 '14 at 16:40
`pv -B 4096 -c -N in /dev/zero | pv -q -B 1000000000 | pv -B 4096 -c -N out -L 100k > /dev/null` - I expect both `pv`s on ends to be smooth (although one being 1GB ahead). It doesn't work this way, unlike with `mbuffer` – Vi. Dec 30 '15 at 22:17

score 9 · Answer 2 · edited Dec 30 '15 at 22:54

9

you can use dd:

producer | dd obs=64K | consumer

It's available on every unix.

edited Dec 30 '15 at 22:54

mikeserv

57,448
9
113
229

answered Oct 03 '11 at 11:58

Michał Šrajer

2,808
17
17

+1 for using standard utility, although `pv` is probably is probably nicer to use (shows progress). – Totor Mar 17 '13 at 15:17
3

Does that actually decouple the reading and writing speed? It seems like `dd` only stores one block at a time, so it would just delay everything by the amount of time it takes to produce the block size; please correct me if I'm wrong. Also, can this buffering be extended to unlimited size, or only whatever's entered for the block size? – Daniel H Jun 17 '13 at 16:45
@DanielH - it does now. – mikeserv Dec 30 '15 at 22:54

score 7 · Answer 3 · answered Oct 03 '11 at 13:35

7

Take a look at mbuffer. It can buffer to memory or memory mapped file(-t/-T).

answered Oct 03 '11 at 13:35

Stephen Paul Lesniewski

531
1
5
5

As I asked for the others, is there a way to tell it to buffer as much as is necessary, or does it have a maximum size? Is there a conceptual reason why most of these programs do have maximum sizes and don't, for example, use a linked list of smaller buffers (or any other arbitrary-size queue implementation)? – Daniel H Jun 19 '13 at 22:11
Probably to prevent out-of-memory errors. You can probably use an option to set a very large buffer (4GB or so) if you want so (try it). – David Balažic Jan 19 '16 at 13:13

score 1 · Answer 4 · answered Jul 04 '14 at 18:48

This is basically a negative answer. It appears that neither dd, nor mbuffer, nor even pv works is all cases, in particular if the rate of data generated by the producer can vary a lot. I give some testcases below. After typing the command, wait for about 10 seconds, then type > (to go to the end of the data, i.e. wait for the end of the input).

zsh -c 'echo foo0; sleep 3; \
        printf "Line %060d\n" {1..123456}; \
        echo foo1; sleep 5; \
        echo foo2' | dd bs=64K | less

Here, after typing >, one has to wait for 5 seconds, meaning that the producer (zsh script) has blocked before the sleep 5. Increasing the bs size to e.g. 32M doesn't change the behavior, though the 32MB buffer is large enough. I suspect that this is because dd blocks on output instead of going on with the input. Using oflag=nonblock is not a solution because this discards data.

zsh -c 'echo foo0; sleep 3; \
        printf "Line %060d\n" {1..123456}; \
        echo foo1; sleep 5; \
        echo foo2' | mbuffer -q | less

With mbuffer, the problem is that the first line (foo0) doesn't appear immediately. There doesn't seem to be any option to enable line-buffering on input.

zsh -c 'echo foo0; sleep 3; \
        printf "Line %060d\n" {1..123456}; \
        echo foo1; sleep 5; \
        echo foo2' | pv -q -B 32m | less

With pv, the behavior is similar to dd. Worse, I suspect that it does wrong things to the terminal since sometimes less can no longer receive input from the terminal; for instance, one cannot quit it with q.

score 0 · Answer 5 · answered Dec 30 '15 at 22:08

Nonstandard move: using socket buffers.

Example:

# echo 2000000000 > /proc/sys/net/core/wmem_max
$ socat -u system:'pv -c -N i /dev/zero',sndbuf=1000000000 - | pv -L 100k -c -N o > /dev/null
        i:  468MB 0:00:16 [ 129kB/s] [  <=>                        ]
        o: 1.56MB 0:00:16 [ 101kB/s] [       <=>                   ]

Implemented two additional tools for this: buffered_pipeline and mapopentounixsocket

$ ./buffered_pipeline ! pv -i 10 -c -N 1 /dev/zero ! $((20*1000*1000)) ! pv -i 10 -L 100k -c -N 2 ! > /dev/zero
        1: 13.4MB 0:00:40 [ 103kB/s] [         <=>      ]
        2: 3.91MB 0:00:40 [ 100kB/s] [         <=>      ]

score 0 · Answer 6 · answered Jan 28 '21 at 09:44

When using pv with --buffer-size|-B you probably want also use --no-splice|-C option to prevent from using splice(2) syscall.

From pv man pages:

-C, --no-splice
Never  use  splice(2),  even  if it would normally be possible.  The splice(2) system
call is a more efficient way of transferring data from or  to  a  pipe  than  regular
read(2)  and write(2), but means that the transfer buffer may not be used.  This pre‐
vents -A and -T from working, so if you want to use -A or -T then you  will  need  to
use  -C,  at  the  cost  of a small loss in transfer efficiency.  (This option has no
effect on systems where splice(2) is unavailable).

As far I'm aware splice is available on Linux therefore for using pv as a buffer there, the -C option should be used.

Therefore your final solution should be something like this:

$ producer | pv -C -B 1G | consumer

Utility to buffer an unbounded amount of data in a pipeline?

6 Answers6

Linked

Related