Kerberos NFS encryption stops working after clock moved backwards

Question

I have a system that uses krb5p security for its NFS mounts. It seems to all work well except when the time is moved backwards (a few hours or so, not sure the exact threshold). When the clock is moved backwards the share then reports "No such file or directory" when trying to mount it (using autofs). Assumption is one of the encryption checks somewhere is checking for backwards time and just aborting, which would be sane default behavior.

I have a test case where client/server are the same machine, as well as some using separate computers, same results. All times are NTP synced. In this test case, the NTP server that everyone uses is being moved backwards.

I realize a bit of a strange test case, but this is a completely offline environment and the software is intended to handle any sort of strange input, like an operator not configuring NTP correctly (typo or something), and this is just testing that edge case.

As soon as I move the clocks back to the present or future, the shares start working again after a kdestroy/kinit cycle.

I have tried kdestroy with a new kinit, clearing the sssd cache, restarting the kadmin/k5rb-server services and full system reboots (which blows away /tmp) but nothing seems to take.

I have full control on the system, so I could "reinstall" something or wipe any configuration/data file, I just can't find where anything is stored to actually make kerberos or NFS forget it ever had a future time (assuming my theory is even right).

System: RHEL7.9

Thanks so much.

Update Just because we got a little sidetracked below, I want to clarify that I understand time moving backwards is strange case. While all suggestions are appreciated for discussion, this question centers around how to fix the SW to handle the case. I realize there are potentially ways to solve it via HW but I'm more curious as to why specifically it is failing and how to recover, not how to prevent it from failing in the first place, if that makes sense. While there are some HW solutions that limit the potential of this edge case, there are none that bring it down to 0 at least with my current system limitations and the HW options are sadly not viable. The SW options may also not be viable, but just trying to understand them. One solution that seems to work so far is re-image the machine but was hoping there was something less draconian.

NTP should always give UTC time, so I'm not sure as to why you are saying time is changing. Kerberos is also very time sensitive, so if you are having large time swings as NTP corrects, then it's little wonder it's failing. — Bib, Mar 21 '22 at 22:35
thanks for the comment. as i mentioned , this is an offline environment and there is no guarantee the NTP source will always move forward. I know kerberos is time sensitive, but where does it store its time information so it can be reset? — vtwaldo21, Mar 22 '22 at 11:48
What happens if you remove `/var/tmp/krb5*` on both hosts? Yes, specifically /var/tmp. — u1686_grawity, Mar 28 '22 at 10:14
/var/tmp is basically empty and there are no krb* files in that or any subfolder under /var/tmp. I did poke around and found a folder /var/cache/krb5rcache, but it is empty. /var/kerberos has the configuration data, all file timestamps match that of system install and are not updated in real time as best i can tell (thanks for the idea) — vtwaldo21, Mar 29 '22 at 11:50
"_there is no guarantee the NTP source will always move forward_" - if you use it properly (even offline) there is that guarantee — roaima, Aug 31 '23 at 07:17

score 0 · Answer 1 · answered Mar 22 '22 at 13:04

0

It does not store the time, it make a vdso syscall to extract the time from o/s kernel. If they are not all in sync, then it will fail. If it's offline, then look at a RPi with a GPS hat and use that as a time source.

answered Mar 22 '22 at 13:04

Bib

2,056
1
4
10

Sadly, another time source is not an option in this environment. As for the syscall, okay, that sounds promising. In this scenario ALL times are in sync. All workstations, all RTCs and local clocks. But if it is getting the time from the os/kernel is it then comparing it another time somewhere, else why get the time? Like it must somewhere have to do an if TimeNow < TimeBefore, fail. So it must store TimeBefore somewhere it seems? Or maybe that happens upstream somehwere of kerberos, I can't really figure out how to extract from the logs why it is failing, but it is consistent-ish. – vtwaldo21 Mar 22 '22 at 13:20
If they all get time synced from a source, and that source rapidly changes time by a large amount, the others will not follow. If anything they will sync slowly. If the gap is too big, they will not sync at all. When that happens KRB will fail. I suggest you buy an atomic clock and use that. It will only cost you a good few thousand $$$. Your scenario means it is unlikely to ever work. – Bib Mar 22 '22 at 15:09
we configure NTP to properly handle this case, so all the clocks sync, quite quickly in fact (couple minutes). There has to be some configuration data saved that is is comparing against, I just can't find it. LDAP also has similar issues with clocks moving backwards but it is trivial to clear an LDAP database. – vtwaldo21 Mar 23 '22 at 13:50
It is not how quickly they sync, it's how wide difference is. You have already stated it is around 2 hours. What you are doing, is the wrong way to solve it. Your choice is allow internet access for NTP, fix your own proper NTP server - ie GPS, or do not allow any time swings. IT IS NOT GOING TO WORK WITH 2 HOUR TIME SWINGS! – Bib Mar 23 '22 at 15:58
Thanks for the feedback, but I'm not sure what difference you are talking about. The clocks are all synced to the exact same time. That difference is 0. I said the NTP source moved backwards 2 hours but all the systems adopted that new time. I think the question remains is why won't it work? What is preventing it? Where is the file it uses to check the old time? Where is the line of code etc that does this check? I can recompile the source or delete/edit any file as needed. – vtwaldo21 Mar 23 '22 at 16:30
I mentiond a few times but I get the test case is a little rare, but there are 10s if not 100s of pretty common and valid cases where the clock may move backwards. GPS spoofing, operator error, the clock was set wrong before it had time to talk to the NTP server, bad BIOS battery, improper clock frequency, edge case testing what happens if the clock moves backwards. Those are just ones I've seen recently in person, I'm sure there are others. – vtwaldo21 Mar 23 '22 at 16:32
IT IS NOT ZERO. THE TIME CHANGE IS NOT INSTANT. IT IS LARGE, BEYOND WHAT KRB REASONABLY EXPECTS. You're not taking advice or accepting the problem you are facing or how to rectify it, you're on your own. – Bib Mar 23 '22 at 16:45
Sorry, I wasn't clear. This is all after a reboot. So NTP server provides a new time, all systems update (hwclocks), reboot. So in theory nothing should know anything about the prior time and NTP is stable. But I'm not seeing that. As for the advice, I am taking it into consideration, but as I stated, none of it has been practical sadly. This is an offline environment, no GPS signals, no internet, no signals/wires of any kind (save power) leaving the secure enclosure. But even with internet and a stratum 3 NTP server available, i can readily reproduce the case so that wouldn't actually help. – vtwaldo21 Mar 23 '22 at 16:56

roaima · Answer 2 · 2023-08-31T07:44:49.607

0

You've got a false edge case. NTP is designed not to jump clocks backwards other than in the instance when it is first deployed. And even then it's only going to jump if the master clock is so wrong it can't otherwise be fixed.

Jumping time backwards would destroy the basis of so many applications that it's not really an option. Databases, Kerberos, even program development are just three that come to mind.

Rather than jumping forwards or backwards, NTP prefers to slew the clock. So in your case, instead of jumping backwards it tried to slow down the clock. NTP implementations such as ntpd also condition the kernel to interpret its reading of the the local hardware clock: learning whether it prefers to run fast or slow and telling the kernel by how much to offset the clock value to get reasonably accurate time in the absence of an external time source.

Furthermore, if you jump the time by more than a few (five) minutes you will not bring your dependents with you; you will have to reset their clocks manually.

Essentially, if you jump the time you should expect Kerberos to stop working; any other result would be wrong.

edited Aug 31 '23 at 07:44

answered Aug 28 '23 at 07:19

roaima

107,089
14
139
261

Id treat this answer with a does of skeptism, not least because it doesn't disambiguate between the protocol and one specific client implementation. It maybe needs further clarification, or citations. Clock jumps can and do happen for systemic reasons, not just at system boot. That's why software should use the monotonic clock for short time deltas and never the system clock. I've worked with systems that had this problem and breakages were fewer than you might expect. The preference to slew is indeed true. Though jumps are much more frequent with SNTP. few readers will know the difference. – Philip Couling Aug 28 '23 at 08:17
1

@PhilipCouling I'll rewrite and clarify – roaima Aug 28 '23 at 09:01
Thanks for the feedback! I understand the comment but that still implies that time is saved somewhere. I'm trying to find where it is stored. Stopping the clocks from moving (jumping) backwards is really a non-starter. That is the test case. We had to disable kerberos due to this bug, but we are moving to RHEL9 and will revisit to see if the condition has improved since RHEL8. This was not a bug in RHEL7 IIRC. – vtwaldo21 Aug 28 '23 at 18:51
@vtwaldo21, time is managed via the RTC, an offset from that to true time at an instant, and a slew factor that allows the kernel to calculate true time based on the real speed of the RTC – roaima Aug 28 '23 at 19:38
Thanks, but the question i was trying to ask is how to clear it. Like the clock has moved backwards is that starting condition. How do you clear the database/cache so the system can work? Completely formatting the drive and reinstalling the OS is one way, but there has to be another. There must be some registry/cache/database that is corrupt and needs to be cleared. Or uninstall some package and reinstall. Note, i'm not interested in preserving any data, just don't want to wait the full hour to reinstall the entire OS. – vtwaldo21 Aug 30 '23 at 23:53
@vtwaldo21 my experience with Kerberos is in an AD environment (mixed Windows and Linux.) Resetting the clock not only on the master but also on the subsidiaries, and then reissuing all the Kerberos trusts got it working again. Can I assume you've tried this already in your environment? – roaima Aug 31 '23 at 07:51
@PhilipCouling thank you for your feedback. I'm not looking for an argument, just a better answer. I've used NTP (ntpd) in big corporate (60000), middling (300-1000), small (<20) and at home. GPS source, Internet source, etc. IME moving the clock backwards (or a long way towards) is a mess every time. Databases crash, so it's basically a no no – roaima Aug 31 '23 at 09:18
1

Yeah I don't deny that some classes of software might have problems. Not sure if you are referring to one form of RDBMS lmany different types of database. Sadly it's not a "false edge case" in all environments. I wasted 20 - 30 FTE days across three months getting to the bottom of this and fixing our software to survive in IOT on networds outside our control https://unix.stackexchange.com/a/549873/20140 – Philip Couling Aug 31 '23 at 09:22

Kerberos NFS encryption stops working after clock moved backwards

2 Answers2