15

On my ext4 filesystem partition I can run the following code:

fs="/mnt/ext4"

#create sparse 100M file on ${fs}
dd if=/dev/zero \
   of=${fs}/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2> /dev/null

#show its actual used size before
echo "Before:"
ls ${fs}/sparse100M -s

#setting the sparse file up as loopback and run md5sum on loopback
losetup /dev/loop0 ${fs}/sparse100M 
md5sum /dev/loop0

#show its actual used size afterwards
echo "After:"
ls ${fs}/sparse100M -s

#release loopback and remove file
losetup -d /dev/loop0
rm ${fs}/sparse100M

which yields

Before:
0 sparse100M
2f282b84e7e608d5852449ed940bfc51  /dev/loop0
After:
0 sparse100M

Doing the very same thing on tmpfs as with:

fs="/tmp"

yields

Before:
0 /tmp/sparse100M
2f282b84e7e608d5852449ed940bfc51  /dev/loop0
After:
102400 /tmp/sparse100M

which basically means that something I expected to merely read the data, caused the sparse file to "blow up like a balloon"?

I expect that is because of less perfect support for sparse file in tmpfs filesystem, and in particular because of the missing FIEMAP ioctl, but I am not sure what causes this behaviour? Can you tell me?

humanityANDpeace
  • 13,722
  • 13
  • 61
  • 107
  • hum. There is a shared (copy-on-write) zero page, that could be used when a sparse page needed to be mmap()ed, for example. So I'm not sure why any type of read from a sparse tmpfs file would require allocating real memory. https://lwn.net/Articles/517465/ . I wondered if this was some side effect of the conversion of loop to use direct io, but it seems there should not be any difference when you try to use the new type of loop on tmpfs. https://www.spinics.net/lists/linux-fsdevel/msg60337.html – sourcejedi Sep 15 '18 at 19:26
  • maybe this might get an answer if it were on SO ? just a thought –  Sep 15 '18 at 22:22
  • 1
    The output of /tmp has different files Before/After. Is that a typo? Before: 0 /tmp/sparse100 (without M at the end) After: 102400 /tmp/sparse100M (with the trailing M). – YoMismo Sep 19 '18 at 13:16
  • @YoMismo, yes was a only a little typo – humanityANDpeace Sep 21 '18 at 08:13

1 Answers1

5

First off you're not alone in puzzling about these sorts of issues.

This is not just limited to tmpfs but has been a concern cited with NFSv4.

If an application reads 'holes' in a sparse file, the file system converts empty blocks into "real" blocks filled with zeros, and returns them to the application.

When md5sum is attempting to scan a file it explicitly chooses to do this in sequential order, which makes a lot of sense based on what md5sum is attempting to do.

As there are fundamentally "holes" in the file, this sequential reading is going to (in some situations) cause a copy on write like operation to fill out the file. This then gets into a deeper issue around whether or not fallocate() as implemented in the filesystem supports FALLOC_FL_PUNCH_HOLE.

Fortunately, not only does tmpfs support this but there is a mechanism to "dig" the holes back out.

Using the CLI utility fallocate we can successfuly detect and re-dig these holes.

As per man 1 fallocate:

-d, --dig-holes
      Detect and dig holes.  This makes the file sparse in-place, without
      using extra disk space.  The minimum size of the hole depends on
      filesystem I/O  block size (usually 4096 bytes).  Also, when using
      this option, --keep-size is implied.  If no range is specified by
      --offset and --length, then the entire file is analyzed for holes.

      You can think of this option as doing a "cp --sparse" and then
      renaming the destination file to the original, without the need for
      extra disk space.

      See --punch-hole for a list of supported filesystems.

fallocate operates on the file level though and when you are running md5sum against a block device (requesting sequential reads) you're tripping up on the exact gap between how the fallocate() syscall should operate. We can see this in action:

In action, using your example we see the following:

$ fs=$(mktemp -d)
$ echo ${fs}
/tmp/tmp.ONTGAS8L06
$ dd if=/dev/zero of=${fs}/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2>/dev/null
$ echo "Before:" "$(ls ${fs}/sparse100M -s)"
Before: 0 /tmp/tmp.ONTGAS8L06/sparse100M
$ sudo losetup /dev/loop0 ${fs}/sparse100M
$ sudo md5sum /dev/loop0
2f282b84e7e608d5852449ed940bfc51  /dev/loop0
$ echo "After:" "$(ls ${fs}/sparse100M -s)"
After: 102400 /tmp/tmp.ONTGAS8L06/sparse100M
$ fallocate -d ${fs}/sparse100M
$ echo "After:" "$(ls ${fs}/sparse100M -s)"
After: 0 /tmp/tmp.ONTGAS8L06/sparse100M

Now... that answers your basic question. My general motto is "get weird" so I dug in further...

$ fs=$(mktemp -d)
$ echo ${fs}
/tmp/tmp.ZcAxvW32GY
$ dd if=/dev/zero of=${fs}/sparse100M conv=sparse seek=$((100*2*1024-1)) count=1 2>/dev/null
$ echo "Before:" "$(ls ${fs}/sparse100M -s)"
Before: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo losetup /dev/loop0 ${fs}/sparse100M
$ echo "After:" "$(ls ${fs}/sparse100M -s)"
After: 1036 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum ${fs}/sparse100M
2f282b84e7e608d5852449ed940bfc51  /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls ${fs}/sparse100M -s)"
After: 1036 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d ${fs}/sparse100M
$ echo "After:" "$(ls ${fs}/sparse100M -s)"
After: 520 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum ${fs}/sparse100M
2f282b84e7e608d5852449ed940bfc51  /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls ${fs}/sparse100M -s)"
After: 520 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d ${fs}/sparse100M
$ echo "After:" "$(ls ${fs}/sparse100M -s)"
After: 516 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d ${fs}/sparse100M
$ sudo md5sum ${fs}/sparse100M
2f282b84e7e608d5852449ed940bfc51  /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls ${fs}/sparse100M -s)"
After: 512 /tmp/tmp.ZcAxvW32GY/sparse100M
$ fallocate -d ${fs}/sparse100M
$ echo "After:" "$(ls ${fs}/sparse100M -s)"
After: 0 /tmp/tmp.ZcAxvW32GY/sparse100M
$ sudo md5sum ${fs}/sparse100M
2f282b84e7e608d5852449ed940bfc51  /tmp/tmp.ZcAxvW32GY/sparse100M
$ echo "After:" "$(ls ${fs}/sparse100M -s)"
After: 0 /tmp/tmp.ZcAxvW32GY/sparse100M

You see that merely the act of performing the losetup changes the size of the sparse file. So this becomes an interesting combination of where tmpfs, the HOLE_PUNCH mechanism, fallocate, and block devices intersect.

Brian Redbeard
  • 2,961
  • 19
  • 44
  • 2
    Thanks for your answer. I'm aware `tmpfs` supports sparse files and punch_hole. That's what makes it so confusing - `tmpfs` _supports_ this, so why go and fill the sparse holes when reading through a loop device? `losetup` doesn't change the file size, but it creates a block device, which on most systems is then scanned for content like: is there a partition table? is there a filesystem with UUID? should I create a /dev/disk/by-uuid/ symlink then? And those reads already cause parts of the sparse file to be allocated, because for some *mysterious reason*, tmpfs fills holes on (some) reads. – frostschutz Sep 20 '18 at 18:12
  • 2
    Can you clarify "_sequential reading is going to (in some situations) cause a copy on write like operation_", please? I'm curious to understand how a read operation would trigger a copy on write action. Thanks! – roaima Sep 20 '18 at 18:12
  • This is odd. On my system I followed the same steps, though manually and not in a script. First I did a 100M file just like the OP. Then I repeated the steps with only a 10MB file. First result : ls -s sparse100M was 102400. But ls -s on the 10MB file was only 328 blocks. ?? – Patrick Taylor Sep 21 '18 at 05:16
  • 1
    @PatrickTaylor ~328K is about what's used after the UUID scanners came by, but you didn't cat / md5sum the loop device for a full read. – frostschutz Sep 21 '18 at 10:53
  • Oh but I did do md5sum on both. Also this was on a virtual machine; don't know if that makes any difference. /tmp was on ext4 so i used /dev/shm. – Patrick Taylor Sep 21 '18 at 14:27
  • Looks like `losetup`'s effects on the size happens asynchronously (ie. after the command returned); try running this several times in a row: `old=1; new=0; sudo losetup /dev/loop0 "$fs/sparse100M"; while : ; do old="$new"; fallocate -d "${fs}/sparse100M"; new="$(ls -s "${fs}/sparse100M" | cut -f1 -d' ')" ; echo -n "$new " ; test "$old" == "$new" && break; done; echo` Sample output: normal: `528 520 516 512 0 0` or weird1: `520 516 512 0 652 520 516 512 0 0`, weird2: `520 516 512 176 520 516 512 0 0`. So you see, async losetup effects: probably done in/by the kernel? but what do I know... –  Sep 21 '18 at 16:31
  • I didn't post this last night because i'm still correlating, but.. When doing an `strace` on the md5sum, I see it's reading the data in 32768 byte chunks. In the process for 100M it's retrieving 3200 chunks. For 10M it's retrieving 320 chunks. (3200 chunks * 32768 bytes) / 1024 = 102400. It seems that it's a function of page cache mapping, but i'm not 100% certain. – Brian Redbeard Sep 21 '18 at 19:20
  • 1
    I was digging through the source for the loop kernel module (in `loop.c`) and saw that there are [two relevant functions](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/block/loop.c?h=v4.19-rc4#n340): `lo_read_simple` & `lo_read_transfer`. There are some minor differences in how they do low level memory allocation... `lo_read_transfer` is actually requesting non-blocking io from `slab.h` (`GFP_NOIO`) while performing a `alloc_page()` call. `lo_read_simple()` on the other hand is not performing `alloc_page()`. – Brian Redbeard Sep 21 '18 at 19:23
  • I've done some more research and now I suspect that this is unintentional behavior - a kernel bug. Your answer isn't quite what I hoped for but even so - awarding you the bounty for sheer effort. Thanks! – frostschutz Sep 22 '18 at 19:33