mnagel

event handlers

Recommended Posts

In Nagios, there is a concept of an event handler that can run to try to fix problems (e.g., restart a service, remove old files, etc.).  I see no similar capability in LM and it is of course something customers want to see happen.  For example, I just deployed a custom DS for someone to check for too many files in a share, indicating a service problem.  Once I implemented that, the next question was "Can you restart the service when that goes into warning?"  I see no facility for this in LM, but perhaps a custom alert could be used to trigger the behavior.  If I used that approach, I would insert a custom HTTP alert into the escalation chain earlier on to give the problem a chance to be corrected, then I will have to create a secure REST API server to accept those and trigger the correct behavior.  So in theory it could be done (if I am not missing something), but it feels like using a screwdriver to hammer in a nail.

Thanks,

Mark

  • Upvote 1

Share this post


Link to post
Share on other sites

Hi Mark - You're correct, on all fronts.  It's functionality we don't currently have, that we're asked for often, and that can be done with a custom datasource.

That all said, it's on our radar and something we're talking about incorporating next year.

Share this post


Link to post
Share on other sites

How is this progressing?  I just had a client request we restart a service when it dies, which is a very normal thing to want, but we still have no way in LM to do it.  I told them I can probably whip up an API script to poll status and react, but obviously it would be much better to have support for these corrective actions integrated.  It is definitely one of the standard behaviors from FOSS tools like Nagios that we miss still!

Thanks,

Mark

 

Share this post


Link to post
Share on other sites

BTW, it would be a lot easier to do the API script if the UI supported editing instance properties -- I saw one update from LM on that FR agreeing it was important, but still not possible :(.  I will proceed with my implementation using external data or structured device-level property labels, but it would be a lot cleaner to add an ILP to indicate desired behavior. 

Thanks,
Mark

 

Share this post


Link to post
Share on other sites

I'm cross-posting this as I saw the link to this thread in the other post I made, so apologies for the duplicate:

Hello,

We're currently looking to evaluate LogicMonitor as a potential replacement for Microsoft System Center Operations Manager (SCOM) and prior to SCOM being our enterprise monitoring tool we had IBM Tivoli Monitoring (ITM) in place within our organization.

So we come from well over 10+ years of being able to take corrective actions without the tools themselves in response to various alerts that are raised and based on our initial demo with the LogicMonitor team, we understand that's not a feature of the product as they don't want to be in config management business which I understand.

However, we can't be the only organization that has this issue so I'm curious how others have worked around this that would be willing to share their solutions. 

Here are some simple things we do today:

  1. 1. Windows Service Restarts (we only alert in most cases if the corrective restart action fails)
  2. 2. Linux Process Counts (we'll attempt to restart the process or execute some type of other scripted action)
  3. 3. IIS Application Pool failures (we'll attempt using builtin Windows functionality to recycle and AppPool)

Appreciate the responses, thanks!

Share this post


Link to post
Share on other sites
2 hours ago, NBM said:

I'm cross-posting this as I saw the link to this thread in the other post I made, so apologies for the duplicate:

Hello,

We're currently looking to evaluate LogicMonitor as a potential replacement for Microsoft System Center Operations Manager (SCOM) and prior to SCOM being our enterprise monitoring tool we had IBM Tivoli Monitoring (ITM) in place within our organization.

So we come from well over 10+ years of being able to take corrective actions without the tools themselves in response to various alerts that are raised and based on our initial demo with the LogicMonitor team, we understand that's not a feature of the product as they don't want to be in config management business which I understand.

However, we can't be the only organization that has this issue so I'm curious how others have worked around this that would be willing to share their solutions. 

Here are some simple things we do today:

  1. 1. Windows Service Restarts (we only alert in most cases if the corrective restart action fails)
  2. 2. Linux Process Counts (we'll attempt to restart the process or execute some type of other scripted action)
  3. 3. IIS Application Pool failures (we'll attempt using builtin Windows functionality to recycle and AppPool)

Appreciate the responses, thanks!

Alert processing happens outside the detection point (in "the cloud") -- there need to be triggers to an event handler that operate in the collector context.  One possibility would be to create datasources that don't actually collect data, but do the check and repair operation, with a datapoint as a side effect.  It would be easier if datasource code could cross-reference other datasource/instance datapoints without having to replicate the same API code into each (e.g., code library support), but it is feasible.  Triggers would be much cleaner.

  • Upvote 1

Share this post


Link to post
Share on other sites
39 minutes ago, mnagel said:

Alert processing happens outside the detection point (in "the cloud") -- there need to be triggers to an event handler that operate in the collector context.  

That makes complete sense and something I initially overlooked as in the context of a server, LM needs to push logic down the to collector (not sure what that process is called) and then the collector needs to push that down to the server and then send those results back up to the cloud via the collector for processing.

In the context of the "check" I'd like to see some type of functionality that would look something like this:

  • From the cloud run a Windows service check through the collector down the server
  • If the service has failed, execute trigger (in the same workflow) to attempt to restart service, and if service restarts, do not alert, but send those results back up to the cloud for processing
  • If the restart fails, do the same thing sending the results back up to the cloud with the difference being generate an alert.

I'm also curious how long this feature request has been asked for since during our demo LM stated they had plenty of customers asking for this functionality. 

Share this post


Link to post
Share on other sites
1 hour ago, mnagel said:

One possibility would be to create datasources that don't actually collect data, but do the check and repair operation, with a datapoint as a side effect. 

Also, is this something that you've actually tested and implemented out of curiosity of just an idea.  Thank you for the feedback as well. 

Share this post


Link to post
Share on other sites

DataSource scripts are meant to execute quickly, so just be careful that you're "task" logic is not doing too much work.  May be better if the task logic is in another script that the DataSource script launched, and then the DataSource script ends.

  • Upvote 1

Share this post


Link to post
Share on other sites

I've created two more DataSources. One that restarts a Linux service and one for Windows that runs specified commands when it detects an alert.

Monitor a Linux Service and restart if needed:  http://blog.mikesuding.com/2018/12/27/service-restart-for-linux/

Generic run some PowerShell (Windows) commands or a script if alert happens:   http://blog.mikesuding.com/2018/12/03/automatic-action-triggered-by-an-alert/

Share this post


Link to post
Share on other sites

Using Powershell, you can issue the task as a job, which will then run in the background without the need for a return message.  This can keep the script "lighter"  I did this in SCOM to handle restarting the SCOM service where necessary.  I'd issue the job and the script would just move on.  I'm currently working on a script that lives on the collector as an external alert target that has a collection of scripts/tasks stored at the collector level.  The message being sent to the collector manager (or whatever I choose to call it... it'll be something witty) can be used to tell it which script you're really trying to run and it runs it for you.  It'll take stored credentials that are passed and use CredSSP delegation to pass those credentials on to a target machine to run the needed task (I've posted the delegation code in another thread about this topic).

By the time I finish, Logic Monitor will probably have it included ;)

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.