11

I want to use files from the World Wide Web as prerequisites in my makefiles:

local.dat: http://example.org/example.gz
    curl -s $< | gzip -d | transmogrify >$@

I only want to "transmogrify" if the remote file is newer than the local file, just like make normally operates.

I do not want to keep a cached copy of example.gz - the files are large, and I don't need the raw data. Preferably I would want to avoid downloading the file at all. The goal is to process a few of these in parallel using the -j make flag.

What is a clean way to solve this? I can think of a few ways to go:

  • Keep an empty dummy file stashed away, updated every time the target is recreated
  • Some plugin using GNU make's new plugin system (which I know nothing about)
  • A make-agnostic way that mounts HTTP servers in the local filesystem

Before digging further, I would like some advice, preferably specific examples!

pipe
  • 921
  • 10
  • 25

2 Answers2

15

Try something like this in your Makefile:

.PHONY: local.dat

local.dat:
    [ -e example.gz ] || touch -d '00:00' example.gz
    curl -z example.gz -s http://example.org/example.gz -o example.gz
    [ -e $@ ] || touch -d 'yesterday 00:00' $@
    if [     "$(shell stat --printf '%Y' example.gz)" \
         -gt "$(shell stat --printf '%Y' $@)"         ] ; then \
      zcat example.gz | transmogrify >$@ ; \
    fi
    truncate -s 0 example.gz
    touch -r $@ example.gz

(note: this is a Makefile, so the indents are tabs, not spaces. of course. It is also important that there are no spaces after the \ on the continuation lines - alternatively get rid of the backslash-escapes and make it one long, almost-unreadable line)

This GNU make recipe first checks that a file called example.gz exists (because we're going to be using it with -z in curl), and creates it with touch if it doesn't. The touch creates it with a timestamp of 00:00 (12am of the current day).

Then it uses curl's -z (--time-cond) option to only download example.gz if it has been modified since the last time it was downloaded. -z can be given an actual date expression, or a filename. If given a filename, it will use the modification time of the file as the time condition.

After that, if local.dat doesn't exist, it creates it with touch, using a timestamp guaranteed to be older than that of example.gz. This is necessary because local.dat has to exist for the next command to use stat to get its mtime timestamp.

Then, if example.gz has a timestamp newer than local.dat, it pipes example.gz into transmogrify and redirects the output to local.dat.

Finally, it does the bookkeeping & cleanup stuff:

  • it truncates example.gz (because you only need to keep a timestamp, and not the whole file)
  • touches example.gz so that it has the same timestamp as local.dat

The .PHONY target ensures that the local.dat target is always executed, even if the file of that name already exists.

Thanks to @Toby Speight for pointing out in the comments that my original version wouldn't work, and why.

Alternatively, if you want to pipe the file directly into transmogrify without downloading it to the filesystem first:

.PHONY: local.dat

local.dat:
    [ -e example.gz ] || touch -d '00:00' example.gz
    [ -e $@ ] || touch -d 'yesterday 00:00' $@
    if [     "$(shell stat --printf '%Y' example.gz)" \
         -gt "$(shell stat --printf '%Y' $@)"         ] ; then \
      curl -z example.gz -s http://example.org/example.gz | transmogrify >$@ ; \
    fi
    touch -r $@ example.gz

NOTE: this is mostly untested so may require some minor changes to get the syntax exactly right. The important thing here is the method, not a copy-paste cargo-cult solution.

I have been using variations of this method (i.e. touch-ing a timestamp file) with make for decades. It works, and usually allows me to avoid having to write my own dependency resolution code in sh (although I've had to do something similar with stat --printf %Y here).

Everyone knows make is a great tool for compiling software...IMO it's also a very much under-rated tool for system admin and scripting tasks.

cas
  • 1
  • 7
  • 119
  • 185
  • 1
    The `-z` flag, of course, assumes that the remote server uses `If-Modified-Since` headers. This might not necessarily be the case. Depending on the server setup, you might instead need to do something with `ETag`, or by checking `Cache-Control` headers, or by checking a separate checksum file (e.g. if the server provides a `sha1sum`). – Bob Feb 20 '18 at 05:37
  • yes, it does. but without that, there's no way at all of doing what the OP wants (unless he's willing to download the huge file to a temp file **every** time he runs `make`, use `cmp` or somthing to compare old and new files, and `mv newfile oldfile` if they're different). BTW, cache-control headers don't tell you if the file is newer than a given time. they tell you how long the server admins want you to cache a given file for - and are often used by marketing droids as a cache-busting practice to "improve" their web stats. – cas Feb 20 '18 at 05:45
  • `ETag` *is* another way of doing it, as is a separate checksum file. It all depends on how the server is set up. For example, one might fetch https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/SHA1SUMS and check if it has changed before deciding to fetch the full ISO. ETag does the same thing, using a header instead of a separate file (and, like `If-Modified-Since`, relies on the HTTP server implementing it). `Cache-Control` would be a last-resort option short of downloading the file if no other methods are supported - it's certainly the least accurate as it tries to predict the future. – Bob Feb 20 '18 at 06:20
  • Arguably, `ETag`/`If-None-Match` and other checksums are more reliable than `If-Modified-Since`, too. In any case, these comments just try to lay out the assumptions of the answer (namely, that `-z` assumes server support) - the basic method should be fairly easy to adapt to other change-checking algorithms. – Bob Feb 20 '18 at 06:21
  • 1
    feel free to write an answer implementing a solution based on ETag. If it's any good, i'll upvote it. and then someone will come along and point out that not all web servers provide an Etag header :). – cas Feb 20 '18 at 06:23
  • I commented instead of adding an answer because it would effectively be a duplicate of yours, just substituting the `curl -z` line with a `curl --header "If-None-Match: $etag"`, [plus a bit of parsing](https://stackoverflow.com/a/12475760/1030702) to retrieve the etag from the response and save/load it. You still have my vote - there's nothing 'wrong' with this answer. – Bob Feb 20 '18 at 06:27
  • @Bob Sometimes the 80% solution is good enough. implementing the perfect solution that covers all cases will take much more time and much more code. and then someone will still come along and point out a few cases that you didn't account for. And, like I said, my answer was about teaching a general technque rather than providing some cargo-cultable code. – cas Feb 20 '18 at 06:29
  • I think there's something slightly amiss here. If we successfully make `local.dat`, then the upstream resource is changed, there's nothing that will cause it to be re-fetched - unless you add a `.PHONY: %.timestamp`. But that has its own problems, as then it causes `local.dat` to *always* be re-built. I usually end up with a two-pass make, where the first pass creates the timestamp files (`ifdef DOWNLOAD` / `.PHONY: %.timestamp` / `endif`) and the second pass (without `.PHONY`) builds the dependent files. – Toby Speight Feb 20 '18 at 10:23
  • Thanks, this is a good start. Ideally I would prefer not having to download the whole file first, but rather pipe it through my tool on-the-fly. It plays much nicer when using parallel make with the `-j` option. Maybe the whole file doesn't have to be downloaded if the `--head` option to curl is used. – pipe Feb 20 '18 at 20:32
  • Actually, it should be possible with `tee`, hmm... Maybe not. – pipe Feb 20 '18 at 20:33
  • @pipe re downloading to disk vs just piping it - that saves on disk space but uses more RAM. RAM is typically in much shorter supply than disk. i.e. downloading it and then deleting it is a way to minimise RAM use at the cost of some temporary use of disk space. If you're worried about wear on SSDs, don't be - that hasn't been an issue for years now, the lifetime of a modern SSD is at least as long as HDD (and even many times longer). Also, files are seekable while pipes are not, which makes it easier to split into chunks for parallel processing - but compression will complicate that. – cas Feb 21 '18 at 00:43
  • BTW, if you really want to pipe it you still can. Just touch a timestamp file and use that with `curl -z timestampfile -s URL | transmogrify`. – cas Feb 21 '18 at 00:48
1

Another alternative is to use a build system that uses dependency checksums to determine whether to trigger rebuilds. I've used the "touch" trick with Gnu Make a lot, but it's much simpler when you can specify dynamic dependencies and when files that don't change don't trigger rebuilds. Here's an example using GoodMake:

#! /usr/local/goodmake.py /bin/sh -se

#! *.date
    # Get the last-modified date
    curl -s -v -X HEAD http://${1%.date} 2>&1 | grep -i '^< Last-Modified:' >$1

#? local.dat
    site=http://example.org/example.gz
    $0 $site.date
    curl -s $site | gzip -d | transmogrify >$1
  • Instead of `-X HEAD`, curl's manpage recommends using `-I`: "(-X) only changes the actual word used in the HTTP request, it does not alter the way curl behaves. So for example if you want to make a proper HEAD request, using -X HEAD will not suffice. You need to use the -I,--head option." – LightStruk Mar 22 '19 at 17:15