mnagel

Alerts if there are X failures over Y time

Recommended Posts

It is currently impossible to detect certain conditions without having to be bombarded by noise alerts, which I am told is against the philosophy of Logic Monitor.  Consider a few cases:

    * interface flaps a few times versus more frequently -- how do you tell the difference?  right now, you have no choice other than perhaps to construct an API script (not tested).  A better solution in this example would be to count the number of flaps over a period of time, and use that as your alert trigger.  As it stands right now, there is not even a method to select the top 10 most unstable interfaces since it is literally a yes or no value and top 10 makes no sense.

    * resource utilization (bandwidth, CPU etc.)  is sometimes much better checked over a period of time than just a single interval.  the answer I have received on that is "require N checks to fail", and this works if the resource is pegged, but not if it is spiky.  As it stands now, the longer of a period you want to simulate via "N checks", the higher the chance one check will reset the alert but the overall result is clearly bad on inspection.

Please note this problem has been solved long ago by other tools, like Zabbix (https://www.zabbix.com/documentation/3.4/manual/config/triggers/expression), so hopefully this can be added to LM in the near future as well.

  • Upvote 4

Share this post


Link to post
Share on other sites

I second this request.  The ability to incorporate a time-based / duration based metric for datasources such as CPU / Memory usage (especially) would be really useful.  We would like to be able to implement this for scenarios such as:  If Device A breaches the configured CPU threshold for more than 1 hour, generate an alert.  If it breaches for less than an hour, do nothing.

We have different application teams, that would want the duration to be customisable, in line with their applications behaviour.

Is this something that is in the Development teams backlog?

Edited by Mike Suding

Share this post


Link to post
Share on other sites

I also need this.  I'd like to be able to have it trigger if the value after n consecutive polls breaches a threshold which is the average of the values from the past day/week/month, given a sampling interval over the selected period.  So for example, is CPU% greater than the average CPU% from the past 7 days readings, at 1 hour intervals.

Share this post


Link to post
Share on other sites

We would also like to have this option for EventSource alerts.  Our use case, we have an application that logs to the Windows Event Log. There are some events for which we are not concerned if the event occurs once or twice within an hour, but if the event were to occur 20 times in an hour, then we would want an alert (and only one alert). 

Share this post


Link to post
Share on other sites

Yeah, I posted at least one FR on this -- it would be necessary to define a correlation key that tracks an incident.  We have used SEC for this previously which provided primitives for handling incidents using that key, but there is no similar capability in LM.  We will probably look at moving Windows event capture into SumoLogic as we were forced to after finding syslog from routers and switches does not work.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.