Is it "broken" to replace an existing file without fsync()?

Question

In Linux's mount(2) man page, I noticed the following excerpt:

Many broken applications don't use fsync() when replacing existing files via patterns such as
fd = open("foo.new")/write(fd,...)/close(fd)/ rename("foo.new", "foo")
or worse yet
fd = open("foo", O_TRUNC)/write(fd,...)/close(fd).
If auto_da_alloc is enabled, ext4 will detect the replace-via-rename and replace-via-truncate patterns and force that any delayed allocation blocks are allocated such that at the next journal commit, in the default data=ordered mode, the data blocks of the new file are forced to disk before the rename() operation is committed. This provides roughly the same level of guarantees as ext3, and avoids the "zero-length" problem that can happen when a system crashes before the delayed allocation blocks are forced to disk.

In what sense is this code "broken"? Are they saying that this code is illegal or not standard-conformant (POSIX, etc)?

Obviously fsync() might be a good idea for people who are worried about what would happen if the system crashed. But assuming a system that doesn't crash, don't both versions of the sample code, without fsync(), do exactly the right thing?

Files are hard. http://danluu.com/file-consistency/ – thrig Jul 22 '16 at 18:00 — thrig, Jul 22 '16 at 18:00

score 5 · Accepted Answer · answered Jul 22 '16 at 17:45

rename is expected to be atomic: it either completes fully or not at all. Renaming A to take the place of B is supposed to leave you with either both A and B intact (it didn't happen at all); or with only A's contents under the name B (it completed fully).

As long as the system doesn't crash, that'll happen regardless of fsync (etc.) calls.

If the system does crash, however, it can turn out that the rename itself hit disk (and thus completes). Remember though that names != files. Files/inodes can have multiple names. Rename is changing the names, not the underlying file/data.

So you can have the state that your program wrote A, renamed it to replace B, and then the power went out. Turns out the filesystem wrote the rename to disk, but not the actual data in A. It's not required to without fsync. You thus wind up with a zero-length B, or a zero-filled B.

The reason an app does a write-temp-file + rename instead of just overwriting the file is because it wants crash safety. The user won't be too mad if a half-written temporary copy of his important document is left lying around, next to the unmodified good copy. But if no good copies are left, the user will not be pleased.

score 2 · Answer 2 · answered Jul 22 '16 at 16:57

2

The code is legal but "naive". The problem is exactly that of what happens during a crash

There's a potential risk that the new data won't have space allocated to it before the directory updates, and so runs the risk of a data loss.

A good app will call fflush() and fsync() to ensure the data is flushed to disk. The auto_da_alloc routines are an attempt to do this heuristically in the kernel.

https://bugzilla.kernel.org/show_bug.cgi?id=103111#c10 explains some of the "gotchas".

answered Jul 22 '16 at 16:57

Stephen Harris

42,369
5
94
123

BTW: this is why you should be careful when using GNU tar. – schily Jul 22 '16 at 17:07
Is there a reason why this is a *particularly* critical place to sync the data? Why wouldn't the same logic lead to the extreme conclusion that you should call `fsync()` after every single operation that touches the filesystem? (By the way, `fflush` is for stdio only, right? Not relevant in this particular context.) – Nate Eldredge Jul 22 '16 at 17:33

ctrl-alt-delor · Answer 3 · 2016-07-22T16:58:38.623

If is perfectly legal, it will work, but it will not do what you want.

The 2nd is obvious. It destroys the original before saving the new.

The 1st is less obvious, it looks like if there was a system fail (e.g. power outage), then you would ether have done nothing (not started); have 2 files: old and new; or succeeded. However this is not the case, unless you do what it says fsync.

They are using the word broken to get your attention. It is broken, as your users will loose data. Maybe not today. Maybe not tomorrow, but soon and for the rest of its life.

This is what is called an intermittent bug: a bug that could sit around for years before exhibiting symptoms.

If you do not care about data integrity of your users, then why do the 1st example.

Is it "broken" to replace an existing file without fsync()?

3 Answers3