Download a file and piping to uncompress is not as fast as expected

Question

I need to download and decompress a file as quickly as possible in a very latency sensitive environment with limited resources (A VM with 1 cpu, 2 cores, 128MB RAM)

Naturally, I tried to pipe the download process to the uncompress process to with the assumption that I could uncompress while downloading simultaneously. I know that piping is blocked by the slowest process. To overcome this I use a buffer in between the download and decompress process.

My shell script looks something like this:

curl -s $CACHE_URL | buffer -S 100M | lz4 -d > /tmp/myfile

If I first download the compressed file and then uncompress without piping the download takes about 250ms and the uncompress takes about 250ms if executed sequentially.

My assumption is therefore that the piped approach will take around 250-275ms since there is no additional disk read in between and the download isn't CPU bound like the decompression so should not affect that much.

But it isn't. It's barely faster as shown by my logs:

Start download  
35211K,      81131K/s
Download & decompressed done in 447ms

Starting individual download & decompress
Download done in 234ms
Decompressed : 61 MiB  
                                                                               
/tmp/myfile        : decoded 75691880 bytes 
Decompress done in 230ms

I'm I thinking wrong here? Is there any other way to speed this up?

`buffer`, assuming that's the one found on Debian, only buffers up to 1MiB by default and can be told to buffer up to 2048 blocks (of 10KiB by default). Try `buffer -s 16K -m 32M` to buffer up to 32MiB, or `pv -qB100M` to buffer up to 100MiB (also avoiding the reblocking that `buffer` does) — Stéphane Chazelas, Apr 23 '23 at 07:04
I don't get what you mean. Piped DL=447ms, separate DL and decompress=234+230=464ms? That's about the same, you expected the piped approach twice as fast? Remember, there's no real diskaccess happening (apart from the /tmp/myfile), as all will happen in cache — stoney, Apr 23 '23 at 12:01
@stoney yes i expect the download+decompress piped to be slightly slower than the slowest of them running individually since the download doesn't have to block for decompression reads. Is this an incorrect assumption? — Richard, Apr 23 '23 at 16:01
"Limited resources": how many CPUs/cores do you have? `top` and press `1` should show you all CPU usages separately. The download is bandwidth limited, the decompress is CPU limited, and also I/O limited but mitigated by cache. — Paul_Pedant, Apr 23 '23 at 17:05
@Richard so what exactly is the purpose of buffer in the pipe? Did you try without? — stoney, Apr 24 '23 at 04:36
I tired without and it doesn’t make any noticeable difference. Again my assumption is that the download process should not block if download is faster than decompression, is this wrong? — Richard, Apr 24 '23 at 11:10
The long pole will be the slower of the two, no matter what. If the download is faster than the decompression, what's the value of downloading and buffering it? The decompression would still lag behind, no? — Andy Dalton, Apr 24 '23 at 14:30
@AndyDalton well if the download is faster each write it does to stdout would be blocked if the decompression cant keep up. If the decompression is faster than the download there is no need for a buffer, but not the other way around — Richard, Apr 25 '23 at 04:35
Yes, the download would be blocked until the decompressor consumed some data from the pipe. So? The long pole would still be the decompression process, and there would always be data in the pipe ready for it to consume. — Andy Dalton, Apr 26 '23 at 10:26
Yes, but it is sequential then, meaning that the download cant continue download until a block of data has been decompressed. My attempt here is to let the download do its thing while at the same time decompress, hence the buffer. So with buffering download can fetch data from the socket without being blocked by a slow decompression process. If the download is done, the buffer contains the data that it can consume. BUT, the total time isnt any faster, hence my question here :) — Richard, Apr 27 '23 at 06:30
If I were you, I would test if (how) the time consumed depends on the number of cores and the amount of RAM. Do you have means to do it? For the very first test I wish I could temporarily assign gigabytes of RAM and many, many cores. If such blatant overkill did not improve the performance of the pipeline in comparison to sequential commands, I could reasonably deduce that the actual limited resources are not the culprit. — Kamil Maciorowski, Apr 27 '23 at 06:52
_Yes, but it is sequential then..._ No, not really. For simplicity, consider N chunks, each of which are the size of the pipe buffer. The downloader will download chunk 1, blocking decompression. Next, the decompressor will process chunk 1 while concurrently the downloader fetches chunk 2. When the decompressor finishes chunk 1, chunk 2 is already there waiting for it -- it can read and process chunk 2 and the downloader will concurrently fetch chunk 3. The decompressor is never blocked waiting for more data once it starts. I don't understand the value of addition buffering. — Andy Dalton, Apr 28 '23 at 13:24
@AndyDalton are you sure? I think you are wrong. If the internal buffer used by the pipes itself (not my attempt to add an additional buffer) fills up, the write will block. See https://unix.stackexchange.com/questions/11946/how-big-is-the-pipe-buffer — Richard, Apr 28 '23 at 17:28
@Richard If the pipe buffer is full, the `write()` will block yes, but a blocked writer won't affect the reader. The _reader_ (the decompressor) would never block waiting on data because the writer (the downloader) outpaces it, there would always be data in the pipe available for the reader to consume. Having the downloader fetch and store the data to some secondary in-memory buffer won't make the reader finish any faster (and won't make the process, as a whole, finish any faster). — Andy Dalton, Apr 28 '23 at 18:20
@Andy Dalton you are right that the reader (decompressor) will never block but the writer (download) will once the buffer is full. Imagine the decompressor is 100x slower than the download and the download takes 1 second. You might then assume that the entire process takes 100 seconds but thats not the case since the download cant finish its writes because the default pipe buffer will be full so it will block. This also means that it cant continue download until there is enough space in the write buffer, so it will wait for the decompressor rather than continue while its decompressing. — Richard, Apr 28 '23 at 19:35
@Richard IMO the problem is you're _greatly_ overbooking your VM. That pipeline is 4 processes actively competing for 2 CPUs. Use `taskset(1)` to see how the involved processes are distributed, and perhaps change their cpu affinity on-the-fly. I guess assigning `curl` and `buffer`'s reader process to one cpu while `buffer`'s writer process and `lz4` to the other cpu may be a worthy attempt. Besides, `buffer` may not be the right tool for this job, unless perhaps you make it reblock friendlily for `lz4`. Also, raising `curl` process's `nice` value might help a bit further. — LL3, Apr 28 '23 at 20:39
The 100MB buffer is (presumably) leaving at most 28MB to run curl, buffer and lz4 and anything else active in the VM. Perhaps buffer is getting ignored because it doesn't perform any vital task. — Jeremy Boden, Apr 29 '23 at 20:17
@JeremyBoden `-S 100M` does not set the buffer size, it instructs `buffer` to print out progress info every 100MB — LL3, Apr 30 '23 at 09:10
@Richard if the decompressor is the slowest link, the runtime of the pipeline is bound by the runtime of the decompressor. If the decompressor will never block, you can't make it go any faster by adding additional buffering. — Andy Dalton, May 04 '23 at 13:13

score 3 · Answer 1 · answered Apr 30 '23 at 10:22

TL;DR: I've tested your command and it seems to work on my machine. Minimum download+decompress time is around 800ms+, and a minimum time of the script using buffers is 700ms.

I'd suggest testing your script on another hardware to see if maybe your VM's CPU is the bottleneck, like already suggested in the comments. Now, I'm not sure how much this "answer" will help you, but here it is anyway.

First, I've tested Stéphane's suggestion of using 16K blocks and 32M mem chunks, and the results reported were much better (around 60% faster):

$ cat compressed.lz4 | buffer -S 100M | lz4 -d > /tmp/myfile
      11858K,     104017K/s
$ cat compressed.lz4 | buffer -s 16K -m 32M -S 100M | lz4 -d > /tmp/myfile
      11858K,     162438K/s

However, after testing your command:

curl -s http://192.168.24.105/compressed2.lz4 | buffer -S 100M | lz4 -d > /tmp/myfile

the results were better without changing any buffer parameters.

I've also noticed that saving the file with lz4 (not outputting it to /dev/null) adds about 100ms to the times.

In my case:

file: 70MB
download time: ~700ms (minimum of 650ms)
decompress and save to /dev/null: ~160ms - 220ms
decompress and save to /tmp/file: ~310ms
decompress and save to a tempfs: ~300ms (slightly better??)
test.bash: ~700ms - 1000ms
test2.bash: ~1100ms, minimum recorded of 800ms

Scripts used: download.bash

start=`date +%s%3N`
curl -s http://192.168.24.105/compressed2.lz4 >/dev/null
end=`date +%s%3N`
echo Execution time was `expr $end - $start` milliseconds.

decompress.bash

start=`date +%s%3N`
lz4 -d /home/pi/compressed2.lz4 > /dev/null
end=`date +%s%3N`
echo Execution time was `expr $end - $start` milliseconds.

test.bash

start=`date +%s%3N`
curl -s http://192.168.24.105/compressed2.lz4 | buffer -S 100M | lz4 -d > /tmp/myfile
end=`date +%s%3N`
echo Execution time was `expr $end - $start` milliseconds.

test2.bash

start=`date +%s%3N`
curl -s http://192.168.24.105/compressed2.lz4 | buffer -s 128K -m 32M | lz4 -d > /tmp/myfile
end=`date +%s%3N`
echo Execution time was `expr $end - $start` milliseconds.

Download a file and piping to uncompress is not as fast as expected

1 Answers1