mnagel

custom speed for interfaces

Recommended Posts

In some cases, it is important to be able to alter the speed of an interface (speed up or down or both) to accurately reflect the connected device speed (which itself may be opaque, like a cable modem that runs at 20/1 on a gigabit port).  I asked this a while back from TS and was told "just change the interface", but when I explained how this is not always feasible (e.g., Cisco ASA), they recommended I open a feature request.  So, here it is -- I would like to be able to set speed up and speed down on an interface, overriding what comes back from ifSpeed.  If I can do this another way (via cloned DS) ,that could work, but I would like to make this as simple as possible since it comes up quite a lot for Internet-edge equipment.

Thanks,

Mark

 

Share this post


Link to post
Share on other sites

So the only requirement for this is to have some way to know what the interfaces 'real' speed is. If you can get the Bandwidth description via SNMP (which I could not find an OID for), then it's easy.

Add a datapoint (say, 'RealSpeed') that gets that speed via the right OID to the Interfaces datasource. 

Then change the Utilization datapoint expression from this:

InOctets*8/if(lt(BasicSpeed,4294967295),BasicSpeed/1000000,Speed)/1000000*100 (which says divide inOctets *8 by BasicSpeed, if BasicSpeed is less than 4294967295, else by Speed)

to

InOctets*8/if(un(RealSpeed),if(lt(BasicSpeed,4294967295),BasicSpeed/1000000,Speed)/1000000*100,RealSpeed) (which says divide InOctets * 8  by RealSpeed, unless RealSpeed is unknown - else use the original expression.) This assumes that the OID reports nothing if that bandwidth description is not set.

 

If there is no OID to get the 'real' bandwidth, you will be able to manually set a property on the instance level where you can define the real bandwidth, and then you could use the property in the expression.  Instance level properties are being developed right now, so should be released within 3 months.

Share this post


Link to post
Share on other sites

Is there an update on this feature request? To reiterate the request, it is important to be able to specify a service speed which can be different than a physical interface speed. The physical interface speed being what is learned from the SNMP poll. Then of course, we need to be able to graph, trigger alerts, etc based on this service speed. If you want to be complete, you'll include both an In service speed and an Out service speed as these are sometimes different.

Share this post


Link to post
Share on other sites

There are instance properties now, but no way I have found to easily define them.  My expectation is I could edit the instance and define properties as needed just like on a device.  Since in the cases where it matters the data MUST be manually defined (e.g., ASA with a gig-E interface hooked to a 50Mbps cable modem), this is a show stopper.

Mark

 

 

Share this post


Link to post
Share on other sites

Any movement on this? The interface percentage graphs are of no help when they are based on the interface speed (1Gbs) and the WAN circuit speed (300Mbs) is considerably less. All of our prior monitoring solutions provided support for this common scenario. Please advise.

  • Upvote 1

Share this post


Link to post
Share on other sites

This is fairly standard functionality in a monitoring tool. Many WAN circuits will be delivered at a rate limited speed. As an example, a carrier circuit may be clocked at 100Mbps or 1Gbps while the service speed is limited to 10Mbps, 20Mbps or 50Mbps. Utilization reporting, alerting and forecasting should be tied to this service speed, not the interface clock rate.

Share this post


Link to post
Share on other sites

Ive actually been working on this problem myself but with a bit of a hack. adding a special string to all "circuit" interface descriptions which also contains a parsable "speed". A new datasource then looks for interfaces across all snmp devices for interfaces with that "string" in the description. Once found that parsed speed is used as the reference speed instead of the interface speed on the new datasource.

Share this post


Link to post
Share on other sites

This is how various FOSS tools (e.g., Observium, LibreNMS) do it and I have considered the same thing.  I have also considered structured device properties (e.g., intf.XXX.upspeed = 20000).  Either requires that I revise multiple similar datasources and diverge from LM standard DSes with any future LM updates requiring a manual merge on multiple portals, but there seems not to be much choice.

Regards,
Mark

Share this post


Link to post
Share on other sites

With the last release, we finally added the ability to manually set instance level properties through the UI, which lets us solve this issue. 

There is a new version of the snmp64_If- datasource - this is available in the registry, but not yet in core.

Improvements in this datasource:

  • Now we support setting instance level properties through the UI (from the Info Tab for any instance via the Manage icon on the custom properties table), we can solve setting custom speed for interfaces.

Setting an instance level property ActualSpeed and/or ActualSpeedUpstream (if different from downstream - if ActualSpeedUpstream is not set, and ActualSpeed is set, ActualSpeed will be used for both upstream and downstream) will override the Speed and BasicSpeed values, used for interface utilization calculation.
Another change - Speed and BasicSpeed are now set as discovered ILPs, rather than unchanging datapoints that were collected every collection cycle (minor efficiency gain).

  • Backward compatible interface status filtering. 

LogicMonitor will by default alert when interfaces change status from up to down. This is helpful for alerts about switch ports that connect to servers or routers, or inter-switch links - but less useful if the ports connect to workstations, that you expect to shut off everyday.
In order to limit this behavior to a certain set of ports, you can now set the property interface_description_alert_enable. If a device has this property set, or if it inherits this property, it will only trigger status alerts for interfaces where the interface description matches the regular-expression contained in that property. All other active ports will be discovered and monitored, but not have their status changes (or flapping) alerted on. (If the property is not set, the current behavior of alerting on status changes for all interfaces is maintained.)

For example, setting the property interface_description_alert_enable to the value “core|uplink” on a group will cause all network devices in that group to alert for status changes only on interfaces with the words “core” or “uplink” in the interface descriptions.
All other interfaces will be monitored, but will not have status alerting enabled. (Other alerts, such as for excessive discards, will still be in effect.)

TO exclude all interfaces with the word bridge in the interface description, set the interface_description_alert_enable property to ^((?!bridge).)*$
(That's a regular expression negative lookahead...) All interfaces, except those with bridge in the description, will have status monitored as normal.

  • change in the way the discard percentage is calculated, to not trigger until there are at least 50 discards per second, as well as the relevant percentage of drops. (This used to be 50 unicast packets per second, but that would still cause alerts on the standby interface of bonded interfaces.)

These changes are backward compatible, and do not lose any datapoint history.

This new datasource is accessible from the registry using the locator code: 

KYE6HN

Note: This datasource has not been through our final internal QA, but is believed reliable (we're running it internally!). It will be improved in a minor way shortly (a future server release will negate the need to collect interface descriptions as an instance level property), and released to core after that - but that change will be backward compatible, for those of you wishing to adopt this early.

Share this post


Link to post
Share on other sites

This is very encouraging -- thank you for the update!  There is one remaining architectural issue to resolve -- there is no way to reference ILPs within alerts or other places where tokens can be used.  Is this something that will be included in the final version?  It would be helpful to show at least some of those values in related alerts.

One more observation on the interface description filtering behavior -- I have developed an API-based method to do the same thing and the one problem we run into is that AD will not update the description of an interface that is has operStatus != up at the time AD runs.  This means that if you want to exclude ports from alerting based on description and you change the description to stop LM from alerting, it will not matter if the interface is already down -- this defeats the purpose of avoiding alarms for unimportant interfaces, at least not without manual instance tuning.  I have played with removing the operStatus filter from AD, but the results were not good.  It really seems like there should be some datapoints that are can be updated regardless of the filter and others that update only when the filter applies.  Or more generally, datapoint groups within a datasource, each able to have a distinct filter.

The new DS should also have a default nonunicast threshold, though my challenge there is I care less about one DP with that happening (in most cases) that seeing it across multiple ports and/or multiple devices.

Thanks,
Mark

Share this post


Link to post
Share on other sites

I opened a  ticket to allow ILPs to be used as tokens in alert messages - thanks for that.

What was wrong with removing the operStatus filter? We're actually thinking of removing that in another update, once we can group interfaces automatically based on status (so you wouldn't have to look in the down ones...

What non-unicast threshold would you want? As a percent of unicast, or ...?

Share this post


Link to post
Share on other sites

Excellent on the token thing -- looking forward to that!

What I found when I removed the operStatus AD filter was that a bunch more interfaces reported alarm almost immediately.  I think my script to deactivate alerts would have eventually caught up, but it was super noisy so I quickly reverted.  I need to look at the new way as my method was a necessary evil given the tools available.  I did notice later, though, that failure to update the description for down interfaces made my script less useful than intended.

Nonunicast is a funny thing.  Acceptable levels tend to vary in different environments (I have seen Nexus 7K cores handle 50000pps without breaking a sweat -- not good, but not deadly to the switch), but there are levels that are absolutely bad in typical environments.  I normally do not set thresholds on percentage as this could trigger for ports with in otherwise inactive hosts seeing not much other than nonunicast traffic.  A rule of thumb is that for access ports, under 200pps can be safely ignored (though it is still high).  Trunk ports will tend to be higher as you will see combined levels for all VLANs on the trunk.  When we see "freak out" levels, they are in the 1000pps or higher range.  Translating to LM-speak, I would start with "> 200 1000 2000" (but again, hard to set just one good threshold).

Share this post


Link to post
Share on other sites

I remember the days when 100 broadcast packets per second could hang a 486...

Nowadays my inclination would be to just set a warning on "> 1000".  Excess  non-unicast is still bad, but unlikely to be traffic impacting bad - and if it is, discards and other thresholds should trigger.

So that would allow investigation in the case of a legitimately problematic non-unicast level, but not generate alerts for situations that are not impacting things, and would otherwise be considered noise alerts. And we should add a "top 25" graph for inbound non-unicast traffic on all interfaces to our default network dashboard, for people that are inclined to investigate this more closely....

(On our infrastructure, we have 200 non-unicast pps on our busiest 10 G ethernet trunks....)

Seem reasonable?

Share this post


Link to post
Share on other sites

I can live with a default of 1000 as it would definitely catch storms, which is the main goal.  We have settled (in our other tools) on 100-200pps as a default with higher levels for trunks.  We also try to watch for such things over extended periods, but average-over-time is a different FR I am awaiting action on :).

When I listed "> 200 1000 2000", was trying to keep to the "LM way" :).

Share this post


Link to post
Share on other sites

Note: as of v100, Instance level properties now work as tokens in alert messages. Development tells me they did prior to v.100 - which I thought I tested, and found the didn't - but in any case they definitely work in v.100.

Share this post


Link to post
Share on other sites

Excellent on the token thing -- looking forward to that!

What I found when I removed the operStatus AD filter was that a bunch more interfaces reported alarm almost immediately.  I think my script to deactivate alerts would have eventually caught up, but it was super noisy so I quickly reverted.  I need to look at the new way as my method was a necessary evil given the tools available.  I did notice later, though, that failure to update the description for down interfaces made my script less useful than intended.

Nonunicast is a funny thing.  Acceptable levels tend to vary in different environments (I have seen Nexus 7K cores handle 50000pps without breaking a sweat -- not good, but not deadly to the switch), but there are levels that are absolutely bad in typical environments.  I normally do not set thresholds on percentage as this could trigger for ports with in otherwise inactive hosts seeing not much other than nonunicast traffic.  A rule of thumb is that for access ports, under 200pps can be safely ignored (though it is still high).  Trunk ports will tend to be higher as you will see combined levels for all VLANs on the trunk.  When we see "freak out" levels, they are in the 1000pps or higher range.  Translating to LM-speak, I would start with "> 200 1000 2000" (but again, hard to set just one good threshold).

On 2/3/2018 at 6:19 PM, Steve Francis said:

Note: as of v100, Instance level properties now work as tokens in alert messages. Development tells me they did prior to v.100 - which I thought I tested, and found the didn't - but in any case they definitely work in v.100.

 

Thanks!  Is the format documented, or is it literally the name within the instance and it is just in scope for the datapoint instance at alert time?  How are clashes with device property names avoided, I guess is my real question...

Share this post


Link to post
Share on other sites
On 04/01/2018 at 11:00 PM, Steve Francis said:

Now we support setting instance level properties through the UI (from the Info Tab for any instance via the Manage icon on the custom properties table), we can solve setting custom speed for interfaces.

Setting an instance level property ActualSpeed and/or ActualSpeedUpstream (if different from downstream - if ActualSpeedUpstream is not set, and ActualSpeed is set, ActualSpeed will be used for both upstream and downstream) will override the Speed and BasicSpeed values, used for interface utilization calculation.
Another change - Speed and BasicSpeed are now set as discovered ILPs, rather than unchanging datapoints that were collected every collection cycle (minor efficiency gain).

  •  


Can you give some examples of what ActualSpeed should be set to? Is it bits? bytes?

Share this post


Link to post
Share on other sites

I'm doing a POC with LM and this was one of the things that I was concerned about.  I'm still learning the system and I appreciate any help I can get with this.

I have a 25Mbps fiber connection in a Gig port so interface utiliaztion % was nearly useless.  On that port, I added a custom property called "ActualSpeed" with a value of 25.  Then I went in and added the datasource for snmp64_If from the LM repository. 

I let it ride over night but my utilization for that port still seems to be based on 1000Mbps, so am I missing another step?

Thanks!

Share this post


Link to post
Share on other sites

Hey - I took the liberty at looking at the logs for your account -looks like you didn't actually import the new datasource version. (Which our UI makes easy to do... something we're fixing.)

You want to make sure you import from the Exchange, not the repository, then give it the locator code RMP2DL

image.png.06eb6f2785ec7abfe0579b81781cf94a.png

 

This is a slightly updated version of the one I mention above - a few efficiency improvements in the code.

The only substantive change is the filter property to filter in interfaces by description has been changed to the property interface.description.alert_enable

as its a bit more descriptive.

The bandwidth properties are still ActualSpeed and ActualSpeedUpstream.

 

Let me know if you have any issues.

Share this post


Link to post
Share on other sites

This is very nice, and it also solves the issue of not alarming when the interface is admin down.  Can I make a suggestion of not calling the Operational Status "StatusRaw" and just using OperState (similar to AdminState)?

When will this be deployed to core?

Share this post


Link to post
Share on other sites
2 minutes ago, Jamie said:

This is very nice, and it also solves the issue of not alarming when the interface is admin down.  Can I make a suggestion of not calling the Operational Status "StatusRaw" and just using OperState (similar to AdminState)?

When will this be deployed to core?

I had to solve this in one portal by creating a complex datapoint in the original DS (I still need to take a look at the revised DS in this thread).  Getting that combined state is harder than you might expect because of a silly design flaw in the DP expression language -- there is neither a 'not' operator nor a 'neq' operator.  So in mine, to know that something is admin up and oper "not up", I have 'ActualStatus' calculated as "and(eq(AdminStatus,1),eq(Status,2))" instead of the more correct "and(eq(AdminStatus,1),neq(Status,1))". Simple to fix in the language interpreter I am sure, but I am not holding my breath.  The one I did works well enough, though, so I can get reliable alerts for unexpected ports down on core devices.

Share this post


Link to post
Share on other sites

I was using a complex datapoint (formula) as well, essentially checking if adminstate = 1, if true then using that as a multiplier against operstate and carrying that value through.  If false, I was instead using 0 as the multiplier and alarming on anything greater than 1.  A bit hacked, but effective.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now