6

I am trying to understand why wc and stat report different things for /proc/[pid]/cmdline.

wc says my shell's cmdline file is 6 bytes in size:

$ wc --bytes /proc/$$/cmdline
6 /proc/10425/cmdline

stat says the file is 0 bytes in size:

$ stat --format='%s' /proc/$$/cmdline
0

file agrees with stat:

$ file /proc/$$/cmdline
/proc/10425/cmdline: empty

cat gives this output:

$ cat -vE /proc/$$/cmdline
-bash^@

All of this is on Linux rather than on any other *nix OS.

Do the stat and wc programs have a different algorithm for computing the number of bytes in a file?

Shane Bishop
  • 499
  • 2
  • 11

2 Answers2

11

The files under /proc are not regular your usual files, but virtual things created on the fly by the kernel. For most (all?) of them, the system doesn't bother calculating a size beforehand, but a program reading it just gets whatever data there is to get.

The difference between what your wc does and what stat and e.g. ls do, is that here, wc opens the file, reads it, and counts what it gets, while stat and ls use the stat() system call to ask the system about the metadata of the file, including the size (but also getting e.g. the owner and permissions). In the case of virtual files, these don't give the same result.

If you run e.g. ls -l /proc/$$/, you'll see a lot files of size 0, even though most of them can be read for data.

Device nodes like /dev/sda are similar, though in their case ls doesn't even bother to show the size, but shows the device numbers instead.

With file in particular, you can use file -s to ask it to just read the data and not care about if it's a special file.

ilkkachu
  • 133,243
  • 15
  • 236
  • 397
  • 7
    Many `wc` implementations with `-c` resort to `stat()` as an optimisation to report the number of bytes in *regular* files. GNU `wc` has special cases to handle files in /proc and /sys since version 8.24. See https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=2662702b9e8643f62c670bbf2fa94b1be1ccf9af – Stéphane Chazelas Feb 01 '23 at 20:06
  • 3
    For `/dev/sda`, like any other file `stat` reports what `lstat()` returns. On Linux `lstat()` returns 0 in st_size for devices, but you'll find some other systems return the size of the underlying storage device there. See also [How can I get the size of a file in a bash script?](https://unix.stackexchange.com/a/321502) – Stéphane Chazelas Feb 01 '23 at 20:10
  • 2
    They are "regular files" in the technical meaning of not being fifos, sockets, devices, symlinks, or whatever else. They're not normal files, though, as you say: materialized from kernel data structures only on `read`, with `stat` being kept cheaper by not doing that and just showing their existence and permissions. – Peter Cordes Feb 02 '23 at 06:11
  • 2
    @StéphaneChazelas *Many `wc` implementations with -c resort to `stat`...* Oh, dear. Thanks for that note. That was an egregiously wrong "optimization", IMO; but I do need to know that some poor sod went out on a limb and actually implemented it. Me, if I want to know the — possibly wrong — information that `stat` will give, I use `stat(1)`. And if I have any suspicion that `stat`'s output might be wrong or misleading, I use `wc`: opening the file as a file, and reading and counting every byte, is what `wc` is *for*. Sheesh. – Steve Summit Feb 02 '23 at 14:02
  • @SteveSummit, it's a valid optimisation as long as not used on file systems that report incorrect information there. Technically, returning 0 could be seen as being not completely wrong as those files don't contain anything until something starts to read from then (at which point the contents is being dynamically generated *for that reading process*). Many `/proc` files would return different contents depending on who/what requests their contents (and when). – Stéphane Chazelas Feb 02 '23 at 16:12
  • @StéphaneChazelas I understand, and this isn't the place for a long discussion on it, but I do a certain amount of low-level work, including on experimental filesystems that might be broken, so "as long as not used on file systems that report incorrect information" is just *not* an assumption I'm willing to make. And this is all the more relevant because, if I suspect that `st_size` has just given me a wrong answer, *the very first tool I instinctively reach for is `wc`!* But I guess I'm going to have to modify that instinct, because it's not safe any more. – Steve Summit Feb 02 '23 at 16:35
  • 1
    @SteveSummit, you can always to `cat file | wc -c` here. `cat` can't have this kind of optimisation. – Stéphane Chazelas Feb 02 '23 at 16:57
  • @StéphaneChazelas Ah, yes, I thought of that, too, but remember, you *can not say* `cat file | program` here on unix.stackexchange.com, because someone will immediately come along and scold you for being a newbie and lecture you that `program` is perfectly capable of opening the file itself! :-) – Steve Summit Feb 02 '23 at 17:17
  • @SteveSummit Oh, we just as often lecture people that the shell is perfectly capable of opening the file itself, hence ` – Charles Duffy Feb 02 '23 at 17:38
  • @CharlesDuffy, well ` – Stéphane Chazelas Feb 02 '23 at 18:47
  • 1
    Same problem with FreeBSD's `wc`. I find ast-open's `wc` does both `fstat()` and `lseek()`, presumably `fstat()` to determine the type and `lseek()` to seek to the end and get the offset there. In any case, they need to do a `lseek(SEEK_CUR)` first to know the current position within the file. – Stéphane Chazelas Feb 02 '23 at 18:55
  • 1
    @CharlesDuffy, actually ast-open's optimisation is also not right. It shouldn't seek to the end if the current position is past the end. See for instance `{ printf 1234 >&0; printf 12 > file; wc -c; printf 56 >&0; } 0<> file` outputting `-2` with that implementation and leaving `file` containing `1256` instead of `12^@^@56` – Stéphane Chazelas Feb 02 '23 at 19:04
  • "Premature optimization is the root of all evil". Or well, overzealous optimization here. That stuff is positively priceless. – ilkkachu Feb 02 '23 at 19:36
  • @StéphaneChazelas A related issue I've been thinking of reporting is that if you do `(lseek 10000; gzip) < file`, you get a false-positive warning from `gzip` saying "stdin: file size changed while zipping". (Admittedly, most people *don't* do that and so don't encounter the issue, because most people don't have or use a command-line `lseek`.) – Steve Summit Feb 02 '23 at 19:43
2

Yes, wc and stat have different algorithms for computing the number of bytes in a file.

wc counts the number of bytes in the file, which in this case is 6.

stat displays information about a file, including its size in bytes. However, /proc/[pid]/cmdline is not a typical file on a file system, but rather a virtual file in the proc file system. This file contains the command line arguments used to start the process with the given process ID. It is stored in memory and not on the file system, which means its size can be different from the actual number of bytes that have been written to it. This is why stat reports the size as 0.

The file command is used to determine the type of a file based on its contents, and it correctly identifies /proc/[pid]/cmdline as an "empty" file.

Summarised, wc counts the actual number of bytes in the file, stat displays information about the file, and file determines the type of the file based on its contents.

Ben
  • 31
  • 1
  • 2
    Actually, the `cmdline` file is not "stored in memory", it's _not stored at all_ and that's precisely why the kernel can't easily tell you how long it is. Any time a `read()` call is done on such a file descriptor, the data is synthesized on the fly and returned to the calling application. If the kernel wanted to tell you how long the file is, it would have to perform a fake internal read and count the bytes in the result, which wouldn't be any faster than having userspace do the counting if it really needs to. – TooTea Feb 02 '23 at 12:21
  • 2
    One can still argue that the data is stored _elsewhere in kernel memory_; sure, it's not anywhere procfs-specific until the `read()` call is invoked (which certainly does make it true that it's unavailable for `stat()` calls to take the size of unless we did the same work needed to generate data to back a `raed()` every time size was requested), but that doesn't mean it's not _in memory_, just that it's not _cheaply accessible to the filesystem layer_. – Charles Duffy Feb 02 '23 at 19:23