8

I have a (physical) box running a stripped down Ubuntu; every now and then (6 times in 3 months), the clock jumps backwards by exactly 300 seconds (+- 0.01 seconds; always exactly 300 seconds). It happens from one minute to the next (I have an external machine polling it once per minute).

The box is running 2.6.26-generic (custom compiled kernel), Ubuntu 9.04 (I know, I'm trying to get it updated, but it's semi-embedded). There is nothing in the logs which indicates what happened, and I have a large selection of pool.ntp.org ntp servers, which correct the problem after a while.

Does anyone know what might cause this?

Additional 1:

I also have a number of other boxes running the same kernel (binary identical), and minor variations of the same software, which do not have this problem. I have also swapped out the hardware.

Additional 2 (summary of my individual comments):

  • I know 9.04 is out of date, I agree it should be updated, and this decision is out of my control. Because management.
  • I have tried a large number of ntp servers, and a small number. It still happens in both cases; if I have a large number of ntp servers, then it fixes itself more quickly.
  • I have swapped out the hardware
  • I am using the same kernel/operating system on another box (with identical hardware), which is not showing the issue.
  • Rebooting has not helped. (this problem has been ongoing for about 6 months)
  • Uptime is about 3 months. The box is "always on", running a PBX (asterisk).
  • Right now, the hwclock matches the software clock exactly - 0.000000 seconds
  • I have not been able to find any cron jobs that reads the hardware clock.
  • There is no load-related pattern (though the load is quite low anyway).
  • It happens during the day and night.
  • It does not happen at regular intervals. Of the ones in the last 3 months, half have happened in the last 9 days.
  • This is not "drift" - 99% of the time, it is within a tiny fraction of a second, then from one minute to the next, it jumps EXACTLY 300 seconds, backwards. So, one minute it might say it's 3:07:03, matching my other computer to within 1 microsecond, 60 seconds later, it says 3:04:03.
  • I can find nothing in the logs.
AMADANON Inc.
  • 853
  • 7
  • 12
  • 1
    Ubuntu 9.04 is beyond end of life, and if you are running a custom kernel in addition, support is goin to be limited at best. A semi-embedded system adds another layer of complexity. – Panther Jan 07 '14 at 22:01
  • 1
    hey, if it was an easy question, I would have fixed it by now :) – AMADANON Inc. Jan 07 '14 at 22:37
  • Try reducing your number of ntp servers to say 3, use ones geographically close to you with good connectivity. – Panther Jan 07 '14 at 22:39
  • Good idea - I've tried that. The only difference was that, with fewer ntp sources, ntpd did fewer queries, and it took longer to come back to normal. – AMADANON Inc. Jan 07 '14 at 22:43
  • Well it it is not ntp, that leaves your custom kernel or hardware by process of elimination. – Panther Jan 07 '14 at 22:44
  • As mentioned in my "additional" section, the same custom kernel is used on other boxes; the hardware has been swapped out... – AMADANON Inc. Jan 08 '14 at 01:13
  • Do `hwclock -r` - that will tell you if the hardware clock is off. It it's off by 5 minutes exactly, I'd suspect that something runs `hwclock -s` every once in a while, possibly some cron job. –  Jan 08 '14 at 02:20
  • I think your best bet is to find an event that correlates with these jumps. 6 times in 3 months = 2 times per month. Were they evenly spread out over the 106 days of uptime you reported in [another comment](http://unix.stackexchange.com/questions/108283/what-could-cause-the-clock-to-jump-by-5-minutes#comment166647_108284)? Did they occur at specific times of the month? Specific times of the day? Any unusual activity in the NTP logs from these times? Could the external machine polling them be somehow responsible? – Joseph R. Jan 08 '14 at 02:45
  • The jumps are very irregular. I've had 3 this month (Jan 2014). Two were on the same day. They are spread over the month, the day, and the week (to within detectible levels, for a sample size of 6). One happened while the office was shut (it is phone system, running asterisk). Right now, the offset of the hardware clock is 0.000000 seconds. There are no cron jobs that set the clock to match the hardware clock that I can find. – AMADANON Inc. Jan 09 '14 at 00:46
  • Random suggestion for further investigation: Find out which of the clocks are being changed. E.g., CLOCK_REALTIME is, but is CLOCK_MONOTONIC as well? What about uptime? This will tell you is something is setting the clock, or if you're hitting a timekeeping bug. I would guess something setting the clock. – derobert Jan 11 '14 at 06:24
  • If it turns out to be something setting the clock, then there are only a few syscalls to do that, and you could add a few `printk`s to find what it is. – derobert Jan 11 '14 at 06:25
  • If you really want to solve this, you probably should turn off NTP for now to reduce the number of variables. Do you know if it will do it until infinity if you don't reset it? – Angelo Jan 22 '14 at 02:38
  • It could be that the timer interrupt, for some reason, went away or was not processed, for exactly five minutes. – Kaz Jan 30 '14 at 04:16
  • That would take 5 minutes. Since I poll the box every minute, I definitely know that this is not the case. The clock jumps 5 minutes back, in the time of 1 minute. – AMADANON Inc. Feb 02 '14 at 00:48

3 Answers3

1

This sounds like a failing Real Time Clock (RTC). If this is spare hardware you can confirm the issue by running a different OS, such as booting a live linux CD or PXE booting, and see if you can replicate the failure. If the exact same time skew occurs on another OS, then you have confirmed that the issue is a hardware failure.

Assuming it is the RTC, you can try the following solutions in order of severity.

  • Replace the CMOS battery. You can try to confirm if it is a failed battery by testing the voltage of your old batter with a multimeter.
  • Change RTCs. If you are lucky and have a fancy motherboard, it might have two RTCs. A high precision clock which is used by default, and a standard RTC. Check the BIOS/EFI settings and see if you can change to the alternate RTC to avoid using the faulty one.
  • Try to replace the RTC. Depending on the age of your motherboard, your RTC is probably either a metal can or chip on the board. You can try to replace this component yourself if you have some electronics skills.
  • Replace the motherboard, since either the RTC or some of the electrical components or leads that interface with the RTC are failing.
Michael Yasumoto
  • 571
  • 4
  • 10
1

You could run a script on the box which keeps track of the running processes and at the same time monitors the clock. If the clock jumps back suddenly, it logs the list of processes active at that time. Maybe that gives a hint which process changes the clock.

Of course, this assumes that you have a software problem. You won't find anything this way if just your hardware is failing.

/bin/bash

oldTime=$(date +%s)
oldPsOutput=$(ps faux)
while sleep 1
do
  currentTime=$(date +%s)
  currentPsOutput=$(ps faux)
  if [ "$currentTime" -lt "$oldTime" ]  # clock change detected?
  then
    echo '========='
    echo "$currentTime < $oldTime"
    echo "$oldPsOutput"
    echo ':::::::::'
    echo "$currentPsOutput"
  fi >> /tmp/clockChangeDetector.log
  oldPsOutput=$currentPsOutput
  oldTime=$currentTime
done
Alfe
  • 261
  • 1
  • 2
  • 9
0

Michael Yasumoto's answer seems to cover all the bases - I agree that you're probably looking at wonky hardware - but here's a practical-ish idea: use a reliable machine with very good internal connectivity that has a handful of cycles to spare to run an NTP server, and then do "whatever it takes" to make the NTP client running on the embedded PBX box spam this local NTP server for time requests as often as possible (eg, every 30 seconds).

Then, when the box is finally upgraded, duly put it aside and figure out what was wrong with it At Some Point(TM). :P

i336_
  • 1,007
  • 1
  • 10
  • 28