SchuylerC

Acknowledged date for a repeating alert condition still shows the original acknowledged date

Recommended Posts

We've been seeing an issue where we get a critical alert, we are notified through our escalation chains, and we acknowledge the alert. However, the action we take to resolve the alert is only enough drop the severity on the alert to error or warning, not clear it entirely. If that alert crosses a critical threshold again it will show up as acknowledged from the first time it went critical, which will prevent all notification.

For example we have threshold for percent used on a volume at >=90 95 98. The volume hits 98%, we are notified and ack the alert, but are only able to clear space to drop the volume down to 92%. If that volume hits 98% again it will show up as already acknowledged and prevents all notifications (see below):

Untitled.png

This is the expected behavior according to LM, but I don't see a benefit in this behavior and it seems risky if you expect to get alerted any time a threshold is crossed. We'd like to be able to receive a notification any time an alert crosses an threshold, regardless if it has been acknowledged at a higher severity for that alert "instance."

Share this post


Link to post
Share on other sites

There's a fine line between sending too many alerts and not enough.

When an alert increases in severity over the state it was acknowledged in, we treat the new level as un-acknowledged.

But in this case, the alert has been acknowledged at the critical level. If it drops to error, but doesn't clear entirely, then goes back to critical, we regard it as being in the same alert session -so the same critical acknowledgement applies. If the alert entirely clears, the future increases to critical are not treated as acknowledged.

We do this for a few use cases:

  • a metric that is oscillating over a threshold. (e.g. a disk volume that is 97%, then 98%, then 97%, then 98%, then 97%, etc) You probably do not want to have fresh escalations each time it bursts over 98%.
  • philosophically, the system is treating the acknowledgement of the alert at the critical level as someone saying "I will assume ownership of this issue, up to this severity." In this case, its the maximum severity (critical), so it is ownership of the issue until it clears.

If you want to prevent alert escalation of the criticals, but are unable to clear the issue, I'd suggest not acknowledging the alert immediately - instead put the instance in scheduled downtime for 1 hour (which will stop escalation).  Work on the issue, clear it to error (or warn).  If you are unable to improve beyond that, acknowledge that state. (Or adjust thresholds.) Then a future increase to critical will be escalated.

Share this post


Link to post
Share on other sites

First, I understand not wanting get overwhelmed by notifications, but in the case you mentioned above (which seems to be the only use case I can see for this behavior) it seems to me that it would be better handled by the alert trigger intervals and alert clear intervals. 

Second, when a alert crosses a threshold the second time a week after the original acknowledgement (as we saw in my first post) I think it is safe to assume that should be considered a new "alert session."

In our case we understand that some of our thresholds need to be adjusted to be more realistic and are working on that to mitigate this issue, but it still seems like common sense that whenever a threshold is crossed notifications are sent. I'd like to see an option to enable that behavior if it is not set that way by default.

Share this post


Link to post
Share on other sites

I have brought this up before and was shot down with the "works as designed".

We 100% agree with this statement
"Second, when a alert crosses a threshold the second time a week after the original acknowledgement (as we saw in my first post) I think it is safe to assume that should be considered a new "alert session."

We have cases with the following conditions:
1. alert triggers on warning threshold
2. NOC acks with "monitoring"
3. alert crosses error threshold
4. NOC escalates to SME
5. NOC acks with "escalating to SME"
5. alert crosses critical threshold
6. NOC acks with "incident created. Management informed" 
7. SME remediates just enough to move the alert down to warning
8. SME informs NOC issue fixed
9. NOC closed incident and resumes watching the alert page
10. alert crosses error threshold
11. No notification
12. alert crosses critical threshold
13. No notification
14. server crashes
15. People ask why no alert....

As a monitoring service, over communication is 100x more acceptable than a server crashing. 

  • Upvote 1

Share this post


Link to post
Share on other sites

Following up with this becuase we keep seeing this "feature" cause issues for us. We are seeing now that this issue breaks some of the functionality of the ServiceNow integration. We have tickets in ServiceNow set to close once the alert clears in LogicMonitor. But in siutations like below where that alert is acknowldged at one severity then changes to another only to change back to the original severity no notifications are sent. See below:

TCR-Veeam-01b.PNG

Here we have a drive begin to fill up. The initial alert creates a ticket and is updated twice as it jumps to error than critical severity. No updates were sent to the ticket after critical alert. The ticket still shows as an open critical because the notifications were suppressed due to previous acknowledgemnets. I don't like that we have to choose between ack'ing alerts and having tickets be updated correctly.

Again I would like to see an option to enable notifications whenever a threhsold is crossed even if the alert has been ack'd at that severity with out the alert clearing entirely.

  • Upvote 1

Share this post


Link to post
Share on other sites

Fair enough - we are redoing the alert escalation chain/rule flow (adding things like allowing alerts to flow through and match multiple rules) - so we'll be looking at a way to address this issue too.

Share this post


Link to post
Share on other sites

Hi,

We haven't integrated Logic monitor with service now.

But we do face the same issue.
When the alert is repeated for the 2nd time in a week it gets acknowledged with the previous one.

Because of which the alert is missed to be reported for the server teams to be actioned.

image.png.0bc7c929e10bea36ca59cbcd79122240.png

Edited by Ashwini

Share this post


Link to post
Share on other sites

An option to set "acknowledged timeout" perhaps?  For example, if an alert if acknowledged for longer than X minutes/days/ weeks then unacknowledge it.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.