mnagel

dependencies, again

Recommended Posts

We continue to do battle with LM when alerts trigger due to dependent resource outages.  I know the topology mapping team is working on alert suppression, but I am not convinced that will solve all problems regardless of how well they succeed.  We really need a way to setup dependencies within logic modules and it should not need dozens of lines of API code each time (most of which should be made available as a library function IMO). 

One fresh specific example -- site with multiple firewalls in a VPN mesh running BGP.  One firewall goes down, then all other firewalls report BGP is down.  We care about BGP down, so we have alerts trigger escalation chains.  It should be possible to define a dependency in the datapoint that suppresses the alert if the remote peer IP is in a down state.  There is no way to express this in LM right now and that leads to many alerts in a batches, and that leads to numb customers who ignore ALL alerts.

  • Upvote 1

Share this post


Link to post
Share on other sites

We are currently in beta for an alert notification suppression feature using topology relationships to map out dependent relationship. We are actively working to extend topology coverage for additional devices and technologies and to also expand the triggering of dependency evaluation beyond Ping-PingLossPercent and HostStatus-idleInteral. 

For the example you provide if BGP was supported by topology the dependency setup would involve selecting a root or entry-point device, such as the collectors server or the remote peer, on which other resources are dependent. Ping and HostStatus not being sufficient we would add definition for which datapoints are used to determine a down state for given devices so that a singular definition for dependency evaluation could be applied that could vary based on device type. This would allow for a down resource to trigger dependency evaluation for its parent and child resources in order to determine root cause and add contextual metadata to alerts so that notifications are suppressed based on the role the alerting resource plays in the dependency incident. (e.g. dependent resource alerts would be suppressed while originating/root cause resource alerts would be routed for notification)

Share this post


Link to post
Share on other sites

The key here is "if BGP was supported...".  What if it is not?  Do you think it would be given this specific case?  I think it could be (i.e., peering topology identified), but to the extent it is not (or anything else is not), I think we need a way to reflect the dependency without serious programming effort to avoid alarm storms.  I guess we have something to chat about next time we meet :)

Share this post


Link to post
Share on other sites

While topology coverage expansion is a high priority, in the next year we also have planned initiatives intended to reduce alert noise based on other sources aside from topology. For example, alert grouping, where we can group related alerts to avoid alerts storms and this grouping is not reliant on topology.

For topology coverage expansion the LM Exchange may help as well to develop new modules that can be shared with the community. 

But yes, definitely have more to chat about on our next call! 

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.