David Lee

497 days and counting........

Recommended Posts

You might have received an alert saying your linux based device has just rebooted, but you know that it has been up a long time.

A switch might have just sent an alert for every interface flapping when they have all been up solidly.

The important question to ask here is how long has the device been up?

If its been up for 497 days,994 days,1491 days or any multiple of 497 then you are seeing the 497 day bug, that hits almost every linux based device that is up for a good length of time.

Anything using a kernel less than 2.6 computes the system uptime based on the internal jiffies counter, which counts the time since boot in units of 10 milliseconds, or jiffies. This counter is a 32-bit counter, which has a maximum value of 2^32, or 4,294,967,296. 

When the counter reaches this value (after 497 days, 2 hours, 27 minutes, and 53 seconds, or approximately 16 months), it wraps back around to zero and continues to increment.

This can result in alerts about reboots that didn’t happen and cause switches to report a flap on all interfaces.

Systems that use 2.6 Kernel and properly supply a 64 bit counter will still alert incorrectly when the 64 bit counter wraps.

A 32 bit counter can hold 4,294,967,295( /4,294,967,295864000/8640000 = 497.1 days)

A 64 bit counter can hold 18,446,744,073,709,551,615 .   (18,446,744,073,709,551,615/8640000 = 2135039823346 days or 5849424173 years)

Though I expect in 6,000 million years we will all have other things to worry over.

Share this post


Link to post
Share on other sites

We handle this properly in LogicMonitor now, if you import these modules:

DataSource: SNMP_HostUptime_Singleton - 42439N
PropertySource: addCategory_snmpUptime - T9WFM4

These replace SNMPUptime-.
 

Share this post


Link to post
Share on other sites

Is it possible to add more detail on this topic?  I tried to implement the fix last night for a couple Cisco fabric interconnect switches but it didn't seem to work.  I have several "Uptime" datasources now and I don't know which one includes the fix or what device types they apply to.  I also have a question about how this fix works.  Is the fix purely on the LM datasource side or is a software update of some kind required on the Cisco side?

I now have at least 5 different datasources that appear to monitor uptime.  Whats the best way to identify which ones include the fix?

Host Uptime-  SNMP OID field says .1.3.6.1.2.1.25.1.1

HostUptime-  SNMP OID field says .1.3.6.1.2.1.25.1.1

SNMP_Engine_Uptime-  SNMP OID field says 1.3.6.1.6.3.10.2.1.3

SNMP_HostUptime_Singleton  SNMP OID field says .1.3.6.1.2.1.25.1.1.0

SNMPUptime-  SNMP OID field says .1.3.6.1.2.1.1.3

 

Share this post


Link to post
Share on other sites

@Kwoodhouse the one that includes the fix is SNMP_HostUptime_Singleton. It requires the addCategory_snmpUptime PropertySource to work without manual intervention.

"HostUptime-" (no space) is deprecated and no longer in core. Unfortunately there's no way for you to get that information in your account currently.

SNMPUptime and SNMP_Engine_Uptime- are more or less duplicates. They both get the uptime for the agent, not the host. This seems to be an oversight.

Originally, we just looked at the uptime counter with a gauge datapoint. If the value indicated uptime of less than 60 seconds, we'd alert. Of course, this happens during a counter wrap. To fix it, we started tracking the uptime counter with a counter. Given that the rate of time is constant, we should always see the rate of 100 ticks/second coming back from the counter datapoint if the host hasn't been rebooted.

The logic in the UptimeAlert CDP looks at both that tick rate, and the raw uptime to determine if the host has rebooted, or the counter has just wrapped. If it's just a counter wrap (no reboot), we'll see 100 ticks/second, even if we see less than 60 seconds of uptime with the gauge. If it's rebooted, the UptimeCounter datapoint could return either No Data (counters need 2 consecutive polls), or, it will return a huge value because no polls were missed, and LM assumed the counter wrapped when it was really reset due to reboot.

This is explained in the datapoint descriptions, but is admittedly a bit difficult to grok without an intimate understanding of how LM's counter/derive work. I do still think it's a rather ingenious solution.

We use "102" instead of "100" ticks/second in the CDP to avoid false positives, as the collection interval isn't always exactly a minute.

I recommend this blog if you're interested in learning more about counter/drive: https://www.logicmonitor.com/blog/the-difference-between-derive-and-counter-datapoints/

I will talk to the Monitoring team about removing some of those duplicates, and getting a public document up explaining it all.

Share this post


Link to post
Share on other sites
Guest
You are commenting as a guest. If you have an account, please sign in.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoticons maximum are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.