• 0

Datasource to monitor Windows Services/Processes automatically?


Question

Hello,

We recently cloned 2 Logic Monitor out of the box datasources (name -> WinService- & WinProcessStats-) in order to enable the 'Active Discovery' feature on those.
We did this because we've the need to discover services/processes automatically, since we don't have an 'exact list' of which services/processes we should monitor (due to the amount of clients [+100] & the different services/solutions across them)

After enabling this it works fine & does what we expect (discovers all the services/processes running in each box), we further added some filters in the active discovery for the services in order to exclude common 'noisy' services & grab only the ones set to automatically start with the system.
Our problem arrives when these 2 specific datasource start to impact the collector performance (due to the huge amount of wmi.queries), it starts to reflect on a huge consumption of CPU (putting that on almost 100% usage all the time) & that further leads to the decrease of the collector performance & data collection (resulting in request timeouts & full WMI queues).

We also thought on creating 2 datasources (services/processes) for each client (with filters to grab critical/wanted processes/services for the client in question) but that's a nightmare (specially when you've clients installing applications without any notice & expecting us to automatically grab & monitor those).

Example of 1 of our scenarios (1 of our clients):

- Collector is a Windows VM (VMWare) & has 8GB of RAM with 4 allocated virtual processors (host processor is a Intel Xeon E5-2698 v3 @ 2.30Ghz)
- Currently, it monitors 78 Windows servers (not including the collector) & those 2 datasource are creating 12 700 instances (4513 - services | 8187 - processes) - examples below

image.png.27e3ad8c766a54920026a29302a4b906.pngThis results in approx. 15 requests per second

image.png.dbc59e960b6ad3bf3c53fac74116c622.png This results in approx. 45 requests per second

According to the collector capacity document (ref. Medium Collector) we are below the limits (for WMI), however, those 2 datasource are contributing A LOT to make the queues full.
We're finding errors in a regular basis - example below

image.thumb.png.b199fbcc6053547a898eb78d38c6194f.png


To sum this up, we were seeking for another 'way' of doing the same thing without consuming so much resources on the collector end (due to the amount of simultaneous WMI queries). Not sure if that's possible though.
Did anyone had this need in the past & was able to come up with a different solution (not so resource exhaustive)?

We're struggling here mainly because we come from a non-agent less solution (which didn't faced this problem due to the individual agent distributed load - per device).

Appreciate the help in advance!

Thanks,

  • Upvote 2
Link to post
Share on other sites

Recommended Posts

  • 0
1 hour ago, Stuart Weenig said:

It's funny, but my lab windows box is unlicensed, so i have about an hour to test stuff until it shuts off. Not a big deal to turn it back on, but it's making things slower. The first thing I did was convert PercentProcessorTime to a counter. That changed the values such that I'm now getting the delta between the current value and the previous poll's value divided by the time between them. So, theoretically, the resulting number should be pretty close, just needs to be adjusted to convert 100s of ns/s to unitless (%). Should mean just dividing the PercentProcessorTime (as a counter) by 107.  Got that in now and i'm going to let it bake to see if the results make sense.

 

I've had problems attempting to do that in the past outside of LM but haven't really resolved it (kinda gave up at the time). I think you need to query the value twice over a set period and also take into account the number of logical cores in the system has. But even doing that I intermittently was getting weird results like 72025954.98% cpu. You'll see some various discussions on replicating task manager per-process cpu % via wmi on google.

It would be great if there was a good solution to this though.

Edited by Mike Moniz
Link to post
Share on other sites
  • 0
2 minutes ago, Mike Moniz said:

 

I've had problems attempting to do that in the past outside of LM but haven't really resolved it (kinda gave up at the time). I think you need to query the value twice over a set period and also take into account the number of logical cores in the system has. But even doing that I intermittently was getting weird results like 72025954.98% cpu. You'll see some various discussions on replicating task manager per-process cpu % via wmi on google.

It would be great if there was a good solution to this though.

 

Yeah I've also been googling about this & found a bunch of threads, however, I got kinda confuse. Did some tests & the results were definitely not accurate. 
Hoping @Stuart Weenig to make it work!

It would definitely be useful!

Link to post
Share on other sites
  • 0
  • Administrators
7 minutes ago, Mike Moniz said:

I think you need to query the value twice over a set period

This is exactly what changing the datapoint to type counter does. It takes the delta between the current polled value and the previously polled value and divides by the time between the two polls.

The results i'm getting are pretty low, like too small by a factor of 10 or 100. 

Link to post
Share on other sites
  • 0
12 minutes ago, Stuart Weenig said:

So, this is what I ended up with (publishing to GitHub after a bit more data is gathered). Notice PercentProcessorTime is a counter and the formula on the complex dp.

Screen Shot 2020-04-17 at 1.35.35 PM.png


Thanks a lot for sharing!

I've applied that to my datasource as well. 

By checking the values against a process that's currently consuming 30-36% of the CPU, it's returning values like '143.5667' for that ProcessCPUPercent.
I guess we need to divide that number by the number of CPU Cores for the server in question.

In my case the box I'm testing that has 4 CPU Cores, which results in a value of 35.89%.
It seems to reflect the actual usage of the process (shown in Task Manager) 🙂

Link to post
Share on other sites
  • 0
  • Administrators
Just now, Vitor Santos said:

I guess we need to divide that number by the number of CPU Cores for the server in question.

Yes, that needs to happen. Otherwise, this metric is very much like CPULoad in Linux boxes, where 100% = one fully loaded core. The thing is that you can't just divide by the number of cores, because you don't know for sure that the threads are split evenly across cores.  

Link to post
Share on other sites
  • 0

Another suggestion for the services check, although this does deviate from the existing WinServices- check is instead of doing 0=Not OK and 1=OK, is to set 0=OK and 1=Not OK. Several other LM Datasources generally work this way in that the larger the number the more urgent the problem so you can do thresholds like > 0 1 2. In this case it that doesn't matter since it's binary option, but it also helps with widgets like table color bars where < does not work very well or gauge widgets and others that also seem to assume this.

Link to post
Share on other sites
  • 0
1 minute ago, Stuart Weenig said:

Yes, that needs to happen. Otherwise, this metric is very much like CPULoad in Linux boxes, where 100% = one fully loaded core. The thing is that you can't just divide by the number of cores, because you don't know for sure that the threads are split evenly across cores.  


Oh I see.
Well, I guess despite that, diving by the number of cores will reflect the process usage on the whole CPU anyways, right?
Even if it's not split evenly, that calculation will reflect the actual process use over the CPU (in a whole). Maybe I'm confusing.
 

Link to post
Share on other sites
  • 0
  • Administrators
On 4/17/2020 at 2:00 PM, Mike Moniz said:

set 0=OK and 1=Not OK

This is exactly what I attempted to do, but I took it a step further. The WMI collector polls the string contained in status/state and checks for presence of one good value. Since this version is groovy, we can enumerate like this:

//Enumerations obtained from https://docs.microsoft.com/en-us/windows/win32/cimwin32prov/win32-service and put in my preferred order so that they are in ascending severity. This allows thresholds for >= 1 3 5.
stateEnum = [
  "Running",
  "Start Pending",
  "Stop Pending",
  "Continue Pending",
  "Pause Pending",
  "Stopped",
  "Paused",
  "Unknown"
]

statusEnum = [
  "OK",
  "Error",
  "Degraded",
  "Unknown",
  "Pred Fail",
  "Starting",
  "Stopping",
  "Service",
  "Stressed",
  "NonRecover",
  "No Contact",
  "Lost Comm"
]

I took all the possible values and tried to sort them into ascending order of problem so that 0 is the optimum state and the higher the number, the higher the criticality of the problem.  If you have suggestions on a better, I'm completely open to reordering this.

Link to post
Share on other sites
  • 0

Thanks Mike , Vitor & Stuart for this datasource. Much Appreciated

Notice there is a ProcessID normal datapoint which doesnt have any threshold value set. We can alert if there is no data available which can mean the process is not running but if we want to setup alert trigger interval what would be the alert threshold?

Can we configure something like below as a complex datapoint? 

On 4/21/2020 at 11:53 PM, Stuart Weenig said:

set 0=OK and 1=Not OK

 

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.