• 0

Datasource to monitor Windows Services/Processes automatically?


Question

Hello,

We recently cloned 2 Logic Monitor out of the box datasources (name -> WinService- & WinProcessStats-) in order to enable the 'Active Discovery' feature on those.
We did this because we've the need to discover services/processes automatically, since we don't have an 'exact list' of which services/processes we should monitor (due to the amount of clients [+100] & the different services/solutions across them)

After enabling this it works fine & does what we expect (discovers all the services/processes running in each box), we further added some filters in the active discovery for the services in order to exclude common 'noisy' services & grab only the ones set to automatically start with the system.
Our problem arrives when these 2 specific datasource start to impact the collector performance (due to the huge amount of wmi.queries), it starts to reflect on a huge consumption of CPU (putting that on almost 100% usage all the time) & that further leads to the decrease of the collector performance & data collection (resulting in request timeouts & full WMI queues).

We also thought on creating 2 datasources (services/processes) for each client (with filters to grab critical/wanted processes/services for the client in question) but that's a nightmare (specially when you've clients installing applications without any notice & expecting us to automatically grab & monitor those).

Example of 1 of our scenarios (1 of our clients):

- Collector is a Windows VM (VMWare) & has 8GB of RAM with 4 allocated virtual processors (host processor is a Intel Xeon E5-2698 v3 @ 2.30Ghz)
- Currently, it monitors 78 Windows servers (not including the collector) & those 2 datasource are creating 12 700 instances (4513 - services | 8187 - processes) - examples below

image.png.27e3ad8c766a54920026a29302a4b906.pngThis results in approx. 15 requests per second

image.png.dbc59e960b6ad3bf3c53fac74116c622.png This results in approx. 45 requests per second

According to the collector capacity document (ref. Medium Collector) we are below the limits (for WMI), however, those 2 datasource are contributing A LOT to make the queues full.
We're finding errors in a regular basis - example below

image.thumb.png.b199fbcc6053547a898eb78d38c6194f.png


To sum this up, we were seeking for another 'way' of doing the same thing without consuming so much resources on the collector end (due to the amount of simultaneous WMI queries). Not sure if that's possible though.
Did anyone had this need in the past & was able to come up with a different solution (not so resource exhaustive)?

We're struggling here mainly because we come from a non-agent less solution (which didn't faced this problem due to the individual agent distributed load - per device).

Appreciate the help in advance!

Thanks,

  • Upvote 2
Link to post
Share on other sites

Recommended Posts

  • 1

From my understanding, the native WMI-based checks will make a new WMI call for each instance, so 1 WMI call for each windows service and process, hence why you see 12k of them. There are a lot of types of checks that work that way, but there is one option that will let you make one WMI call per device (if you can get all the data in one call) and extract  in bulk for all instances at once: BATCHSCRIPT. I'm not sure if it would completely help in your situation, but if you switch from native WMI to using something like a PowerShell or Groovy BatchScript, you can send one WMI query to the server and get data for all services/processes at once. Scripts do cause more load on the collector than most native checks, but 150 script instances (75*2) are likely less load then 12k WMI instances.. Actually I think the collector does WMI queries via powershell anyway, not 100% sure about that, so even less of a concern.

You can still keep the old WMI AD method and just move Collector Attributes to use batchscript.

  • Like 2
  • Upvote 1
Link to post
Share on other sites
  • 1
  • Administrators

Yep: https://github.com/sweenig/monitoring-recipes/blob/master/DataSources/Groovy/WMI/WMI_Query.groovy

I'm trying to find a definitive answer, but I think the restriction is that PowerShell based datasources can only run on Windows collectors. The product team built a WMI class for groovy that's been included in the collector for several versions now. So, WMI in groovy on linux, yes. WMI in groovy on Windows, yes. WMI in powershell on linux, no (because linux can't interpret the powershell scripts).  Don't quote me on this one, still looking for definitive proof.

Link to post
Share on other sites
  • 0
  • Administrators

Take a look at these two datasources: https://github.com/sweenig/lmcommunity/tree/master/ProcessMonitoring. I just realized i haven't done any documentation on that part of the repo, so give me a few minutes and i'll commit some instructions.

It doesn't change how many resources are used to monitor a process, but it does do what i think you were referring to in the nightmare scenario above. They let you specify an include and an exclude filter as properties on the device level or on the group level. So, you can just provide a regex expression to dictate what you want to include and what you want to exclude in active discovery.

Link to post
Share on other sites
  • 0

Hello @Stuart Weenig,

If I understood it properly, I think we did that already with a custom property to actually filter 'noisy' services using a regex expression (on a group/device level). That helps to create exceptions indeed.
However, that leaves us with the same issue still. Due to the reasons I mentioned above.

Link to post
Share on other sites
  • 0
  • Administrators

Seems like what it's coming down to is that you are trying to monitor more stuff than the current collector resources can handle. Only two options really: reduce collection (stricter filters) or increase collector size.  You're already excluding manual and disabled services, right?

Link to post
Share on other sites
  • 0
  • Administrators

Ooo, that's a great point. Yes, the groovy based WMI query could do a single call and grab all the data. You'd have to parse through it and print each bit out, but that's easily doable with a for loop.

And keeping discovery using the WMI method is an excellent option.

Link to post
Share on other sites
  • 0
3 minutes ago, Stuart Weenig said:

Yes, the groovy based WMI query could do a single call and grab all the data.

Can you do WMI using groovy? I looked around before but didn't find an example. I try to use groovy when I can, although WMI is only supported on Windows collectors anyway.

Edited by Mike Moniz
Link to post
Share on other sites
  • 0

(Sorry for semi-hijacking the thread)

Interesting and thanks! Do you have documentation on the WMI class? For example how would you pass in wmi.user and wmi.pass, which I would expect would be needed under linux or if you need to use other creds on a windows box.

 

Link to post
Share on other sites
  • 0

@Stuart Weenig so just some background what we (Vitor and I work at same MSP) are trying to replicate in LogicMonitor is what UIM's ntservices & processes probe can do. We are migrating to LM and looking to replicate our existing UIM functionality in LM. The way we configured the ntservices probe is once deployed with the custom package is to auto monitor all services set to Automatic minus all the known excluded services that no one cares about. The same goes for the processes probe. We monitor all processes on a box and alert on any individual process that spikes to over say 95% cpu for more than 15 min. So both of these (ntservices & processes) probes ran locally on each resource. But now using this batch script method seems to be helping out on the overall resource usage on each collector. 

 

Link to post
Share on other sites
  • 0
  • Administrators

Yes, using batchscript will definitely improve your performance since it should do one fetch per device, instead of one fetch per service.

So, your discovery filter is set to include only services where the start mode is "Automatic". And you've got another discovery filter that excludes the services that no one cares about. If that list is the same and never changes, no need to put it in a property, it's just over-complicating things.

So, it's working?  Did you make a version for the process monitoring?

Link to post
Share on other sites
  • 0
  1. Hello guys,

Thanks a lot for the feedback & ideas @Mike Moniz & @Stuart Weenig!

We really enjoyed Mike suggestion of doing a script to actually poll the data in bulk - per device (instead of 1 query per service/process instance).
After reading some documentation & understanding how to use WMI on groovy (with help of some OutOfBox datasources) we ended up by rebuilding those 2 datasources (services & processes) using groovy (making use of the WMI class).

They're collecting exactly the same metrics as the out of the box ones & after 2/3 hours in production those really reduced the collector CPU/Memory usage - used 1 of our clients as a pilot

Previously the usage was constantly at >98%, now it's monitoring the same amount of services/processes & using 60-70% of CPU.

It increased the groovy instances exponentially - examples below

image.png.a588cccf6522b769695a3e99f93ef788.png

image.png.a5248f7dcb99c66721093b64b68c7521.png

However, this doesn't seem to affect the collector resources usage that much.

We're also no longer seeing the WMI timeouts on the WMI instance runs. We guess this is really making a difference.
I know it's too soon to say this solved it, but by judging how the resource usage looks, this seems to make a big difference.

Also, the list of services changes, that's why we ended up using a property.

I'll share both datasources with you guys in a few minutes (I'll just attach the .XML file(s) to this thread) in case this is useful for anyone.

Regards,

  • Like 1
Link to post
Share on other sites
  • 0

Not at all, feel free to use them.
I was thinking on submitting a thread in LM Exchange containing those 2 datasources but I didn't wanted to create repeated threads (like I did a couple of days ago 😟).

Let me know if we should submit a post within LM Exchange containing those 2 datasources for future record or, if this one is enough for the sharing purpose.

Regards,

Link to post
Share on other sites
  • 0
  • Administrators

This one is enough. The Exchange area of the community is for asking questions about stuff already in the exchange. Give it some time though. Once the new exchange is rolled out, you'll be able to publish it yourself and maintain it on our exchange with version control and everything. I'm cleaning it up a bit before i put it into my repo. Once i do, i'll post here and you can test it out to make sure it works the same. There is some grooviness I can add to your code to make it simpler.

Link to post
Share on other sites
  • 0
  • Administrators

No worries at all. I thought Groovy was a made up language when I got my LM interview exam. 😉

I've groovified your services DS. It's here. I'll work on the process one tomorrow. You should be able to see the ad.groovy and collect.groovy scripts right there in the repo, so you can take a look at how I simplified it. Unfortunately for groovy, I can make condensed, simpler code that is harder to read (python, for example, won't really let you make something so condensed that it's unreadable). 

Link to post
Share on other sites
  • 0

As a thought, if you add a new datapoint "ProcessId" which outputs the process id of the Windows Service (which is a number), you can then use the delta threshold to cause an alert if the service restarts between checks. Just as an option if it's ever needed at the instance level.

Link to post
Share on other sites
  • 0
  • Administrators

I'm thinking some of the datapoints on the process DS needs to be adjusted. The PercentProcessorTime, for example, is not an actual percent. It's measured in 100s of nanoseconds since the process started. Which means it's more like a counter than a gauge. I'm going to do some testing.

Link to post
Share on other sites
  • 0
1 minute ago, Stuart Weenig said:

I'm thinking some of the datapoints on the process DS needs to be adjusted. The PercentProcessorTime, for example, is not an actual percent. It's measured in 100s of nanoseconds since the process started. Which means it's more like a counter than a gauge. I'm going to do some testing.


Hey Stuart!

I noticed that this morning & was going to reference that in a bit. Glad you noticed it, because I was going to post in the communities asking for some help.
We were seeking in having the process actual CPU usage (like we see on TaskManager for example - for the whole CPU).

Would be nice if you actually figure out what calculation is required to do that.

Link to post
Share on other sites
  • 0
  • Administrators

It's funny, but my lab windows box is unlicensed, so i have about an hour to test stuff until it shuts off. Not a big deal to turn it back on, but it's making things slower. The first thing I did was convert PercentProcessorTime to a counter. That changed the values such that I'm now getting the delta between the current value and the previous poll's value divided by the time between them. So, theoretically, the resulting number should be pretty close, just needs to be adjusted to convert 100s of ns/s to unitless (%). Should mean just dividing the PercentProcessorTime (as a counter) by 107.  Got that in now and i'm going to let it bake to see if the results make sense.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.