7

How can I execute a command only if a certain file exceeds a defined size? Both should at the end run as a oneliner in crontab.

Pseudocode:

* * * * * find /cache/myfile.csv -size +5G && echo "file is > 5GB"
Kusalananda
  • 320,670
  • 36
  • 633
  • 936
membersound
  • 431
  • 1
  • 5
  • 17
  • If your goal is to trigger this when the file exceeds that size, but the file is only infrequently written to, you may want to use [incron](http://inotify.aiken.cz/?section=incron&page=about&lang=en) to trigger the check instead of running it every minute. – Austin Hemmelgarn May 17 '23 at 20:34
  • 1
    I don't look closely at the answers. Just FYI, beware of sparse files reported wrongly. These are not easy to handle. – akostadinov May 18 '23 at 00:02

4 Answers4

11

If you have GNU stat, you can use its --printf option to get its size.

e.g.

size=$(stat --printf '%s' /cache/myfile.csv)
if [ "$size" -gt 5368709120 ] ; then  # 5 GiB = 5 * 1024 * 1024 * 1024
  echo "file is > 5GB"
fi

See man stat for details.


BSD's stat (e.g. on FreeBSD and on Mac) has a similar formatting option, -f:

size=$(stat -f '%z' /cache/myfile.csv)

Alternatively, you could use perl's built-in stat function, or its -s file test operator (which is similar to bash's -s file test but it returns the file's size rather than just true if it exists and is non-empty). perl's stat function returns a 13-element list (array) of metadata about a file containing the following data (copied from perldoc -f stat):

[...] Not all fields are supported on all filesystem types. Here are
the meanings of the fields: 

  0 dev      device number of filesystem
  1 ino      inode number
  2 mode     file mode  (type and permissions)
  3 nlink    number of (hard) links to the file 
  4 uid      numeric user ID of file's owner
  5 gid      numeric group ID of file's owner
  6 rdev     the device identifier (special files only) 
  7 size     total size of file, in bytes
  8 atime    last access time in seconds since the epoch
  9 mtime    last modify time in seconds since the epoch
 10 ctime    inode change time in seconds since the epoch (*)
 11 blksize  preferred I/O size in bytes for interacting with the
             file (may vary from file to file)
 12 blocks   actual number of system-specific blocks allocated
             on disk (often, but not always, 512 bytes each) 

(The epoch was at 00:00 January 1, 1970 GMT.)

Field 7 is the one we need.

To return the file's size (for later use in a shell command or script) using stat:

# stat
perl -e 'print scalar((stat(shift))[7])' /cache/myfile.csv

# -s
perl -e 'print -s shift' /cache/myfile.csv

Or to do it all in perl:

# stat
perl -e 'print "File is > 5 GiB\n" if (stat(shift))[7] > 5*1024*1024*1024' /cache/myfile.csv

# -s
perl -e 'print "File is > 5 GiB\n" if -s shift > 5*1024*1024*1024' /cache/myfile.csv

See perldoc -f stat and perldoc -f -X (as well as help test in bash).

BTW, perl's shift function removes the first element of an array (by default @ARGV, the array of command line args, if not specified) and returns its value. It's often used in a loop to process all elements of an array, but here we're only interested in the first arg (the filename). See perldoc -f shift for details, including notes on lexical scope and use in a subroutine.

cas
  • 1
  • 7
  • 119
  • 185
  • The OP question is `if find-command was successful` – Gilles Quénot May 16 '23 at 13:40
  • 13
    yes, but the OP is asking the wrong question, using the wrong tool for the job. If you want to measure your door's height or width, you use a tape-measure or a ruler, not a bucket or a hammer. Similarly, if you want to know the size of a file, you use `stat`, not `find` (and not `ls` either). Part of our job when answering a question is to tell people when they're using the wrong tool or asking the wrong question, to find the underlying task hidden beneath the [XY Problem](https://xyproblem.info/). – cas May 16 '23 at 13:45
  • @cas If you want to test the size of a file _portably_, then `find` is the correct tool (albeit not with "+5G" as the argument to `-size`). – Kusalananda May 17 '23 at 13:08
  • 2
    Portability was never part of the question. It's tagged `linux`, and linux means GNU tools on everything but tiny distros with only busybox available (and even busybox stat has a `-c` formatting option with `%s` meaning size in bytes just like GNU stat). More to the point, `find` is the wrong tool for getting metadata about a file such as the file's size. That's `stat`'s job, it's what it's for. If stat didn't have formatting options, the next best option is not find, it's perl with its built-in `stat()` function because that's a trivial one-liner compared to a dozen or so lines in C. – cas May 17 '23 at 15:48
  • e.g. `perl -e '$size = (stat(shift))[7]; print $size' /cache/myfile.csv` – cas May 17 '23 at 15:49
  • or, even shorter, `perl -e 'print scalar((stat(shift))[7])' /cache/myfile.csv` – cas May 17 '23 at 15:57
  • 1
    OP asked for a one-liner for use in crontab. How can this be used in the way OP asked? – marcelm May 17 '23 at 20:19
  • they already know how to use crontab, they didn't know how to test the file size. – cas May 17 '23 at 22:43
  • the perl version can be even shorter - perl's `-s` file test returns the size of the file (bash's `-s` test only returns true if the file exist and is not empty, false otherwise), so extracting the size from the list returned by `stat()` isn't necessary. e.g. `perl -e 'print -s shift' filename` to output the size for use in shell, or do it all in perl with `print "File is > 5GB\n" if -s shift > 5*1024*1024*1024' filename`. See `perldoc -f -X` for docs on perl's file tests (and `help test` in bash for bash's file tests). – cas May 17 '23 at 23:03
8

To use the file size as a precondition you can use stat or find:

[ -n "$(find /cache/myfile.csv -prune -size +5G 2>/dev/null)" ] && echo "file is > 5GB"

Or if the target command (echo, here) is short, put it into the exec part of `find

find /cache/myfile.csv -prune -size +5G -exec echo "file is > 5GB" \;

The -prune is in case myfile.csv might be a file of type directory, to prevent find from descending into it.

Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
roaima
  • 107,089
  • 14
  • 139
  • 261
4

If you need to treat files in a shell, both version only execute shell's command only if all conditions are met: is a file, is named myfile.csv and is > 5G:

find /cache -name 'myfile.csv' -type f -size +5G -exec bash -c '
    echo "$1 is > 5GB"
' bash {} \;

or

find /cache -name 'myfile.csv' -type f -size +5G -exec bash -c '
    for file; do echo "$file is > 5GB"; done
' bash {} +
Gilles Quénot
  • 31,569
  • 7
  • 64
  • 82
  • I don't want to iterate the files. I just want to use this as a precondition before starting another process. I could as well write `find .... +5G && start.sh`. So, only start the 2nd command if the find command found the file which was above a certain size. – membersound May 16 '23 at 13:15
  • 1
    So use the first version, and replace `echo` by `start.sh` – Gilles Quénot May 16 '23 at 13:16
  • if you don't want to iterate over files, then don't use find. You could use `stat` instead. – cas May 16 '23 at 13:23
  • 2
    @membersound If `/cache/myfile` isn't a directory, neither command in the answer will do much iterating. Using `find` is about the only portable way of conditionally executing a command based on the size of a file. – Kusalananda May 16 '23 at 13:25
  • @Kusalananda, for readable files, `wc -c` can get the size of a file portably (though not always as efficiently in the `wc` implementations that don't do optimisations when the size of the file can be obtained other than by reading it). – Stéphane Chazelas May 17 '23 at 16:43
2

Note that some shells have the feature built-in.

SHELL=/bin/tcsh
* * * * * if (-Z /cache/myfile.csv > 5*1024*1024*1024) echo 'file is > 5GiB'

Or with zsh, here using glob qualifiers and an anonymous functions, though zsh also has a stat builtin that predates both GNU and BSD stat:

SHELL=/bin/zsh
* * * * * (){ if (($#)) echo 'file is > 5GiB'; } /cache/myfile.csv(NLG+5)

(note that like for find -size +5G, we're talking of gibibytes (1GiB = 1,073,741,824 bytes) here, not gigabytes (1GB = 1,000,000,000 bytes))

For symlinks, tcsh will get the size of the file it eventually resolved to while zsh's LG+5 qualifier like find's -size will check the size of symlink itself. Change to -LG+5 to check the size after symlink resolution. zsh's stat builtin gives you information after symlink resolution by default, -L to change that. In GNU and BSD stat, that's reversed. Same with find where -L tells it to follow symlinks.

For more ways to get the size of a file, see How can I get the size of a file in a bash script?

Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501