• Content Count

  • Joined

  • Last visited

  • Days Won


Everything posted by mnagel

  1. I really struggled with the title there since it is hard to explain in a few words. Here is a scenario I hope explains the problem: * Windows server in a locked down segment, local collector used on-server * Same server presents various protocols to the outside world (e.g., web, ftp, sftp, etc.). We want to monitor any/all of these from outside the server to assure clients are able to reach those protocols. To achieve this now is virtually impossible. You can setup service checks for just ping and web, but not other protocols. Even then, services applied to an externa
  2. This would be a fairly simple embedded Groovy datasource. This is a good starting point for calculating directory size: My recommendation is to define a custom property with the target path and use Groovy to extract that at the script initialization, for example: def hostname = hostProps.get("system.hostname"); def filepath = hostProps.get("dirsizecheck.filepath"); You can then use "dirsizecheck.filepath" as your "Applies To" to automatically run this DS for any hosts that have that defined. Not appropria
  3. We have the same challenge -- what I have been told is to clone the underlying DS and effectively partition the "Applies To" across the dual DS's ala . On the one that needs to be limited, more filters can be added. I do understand this concept, but I continue to be concerned this methodology will lead to the equivalent of spaghetti programming -- many clones with no relation and no way to ensure that common features are maintained across the clone group. I proposed a "linked clone"
  4. I am not quite sure how to express this in the UI, but there needs to be a way to indicate that for certain device types, a minimum set of datasources are required to be associated. We just had a case where the WinExchangeMBGeneral datasource (32 datapoints, some critical) just didn't get associated and we only found out after a major event. The active discovery stuff is very nice, but it is a "fail open" sort of thing where if AD does not successfully operate, you just have no way to know without heavy investigation. My idea is that once a device has been identified as a particular type (l
  5. Yep, agreed! And reporting on alerts that were sent and to whom. Clients for some crazy reason want to know stuff like that during incident post-mortems :). Mark
  6. In Nagios, there is a concept of an event handler that can run to try to fix problems (e.g., restart a service, remove old files, etc.). I see no similar capability in LM and it is of course something customers want to see happen. For example, I just deployed a custom DS for someone to check for too many files in a share, indicating a service problem. Once I implemented that, the next question was "Can you restart the service when that goes into warning?" I see no facility for this in LM, but perhaps a custom alert could be used to trigger the behavior. If I used that approach, I would in
  7. Currently it is possible to control discovery of instances for a datasource with filters. It is not, however, possible to avoid data collection in a similar manner. Adding a collection filter (bypass?) capability could greatly improve performance, especially considering cases like interfaces, where 25+ datapoints are collected. In that case, a collection filter might indicate "primary" datapoints always collected, and then depending on their state, skip the rest (e.g., ifAdminStatus == down or ifOperStatus == down). I know this makes it more complex, which is not good, but a generic facili
  8. Please make it possible to search by other fields than name and keywords. For example, I would like to search for all Powershell datasources so I can have a few working examples in ready reach. Not sure which fields should be in scope, but if in doubt, I would say all of them! Thanks, Mark
  9. I personally do not see how you could do this within the LM framework directly. What I would try is to setup the constant traceroutes on the collectors (or nearby) via external tools, and publish that data in a machine-readable format (e.g., JSON). You can then use a BatchScript to acquire that data and display it, but even then, then intermediate hops are likely to change over time and I don't know how that would end up looking as instance counts increase and decrease (and change descriptions) -- maybe someone more familiar with that has some ideas. Related to this, we have a tool we d
  10. I have noticed that a common pattern with LM is to clone a datasource and tweak it, mainly for the Applies To. Each time this happens, if an underlying change is found to be needed, this must now be replicated into every clone, offering many chances for semantic divergence over time. How about adding the idea of a linked clone, so that all clone fields are tied to the original DS until overridden? The applies to can be overridden by default, as is done now, but then you can benefit from changes to the original DS in the clones except for any specific changes made to them via override.
  11. Please allow reports to be scheduled down to an hourly level (e.g., checkboxes for different hours in a day) -- daily is not sufficient for some applications. I can workaround this by creating N copies of the same report scheduled daily at different times, but seems like a lot of opportunity for definitions to slip out of sync. Thanks, Mark
  12. Sadly I have no answer, but I agree we need more in-depth documentation on LMS. I am just trying to figure out how parameters sent to a function can be interpolated into a string, and I am forced to guess since no examples exist and the docs are silent on this. Is it '+' or '_' to concatenate strings? A full syntax guide would be welcome!
  13. Regarding this statement: 3) We intentionally removed the option to set a Level filter to Informational You may or may not be aware that Windows logs some important issues as Informational events. See event ID Microsoft-Windows-Security-Auditing/4740 for one example, and this (to me) is sufficient to counter this design decision. OTOH, that one example may be better handled with a script eventsource, but the point is that it should be possible to do what is needed in the normal eventsource since Windows is not smart about how it handles some events. It should be hard to enable
  14. The current DS is nice, but is limited. It would be very nice to support SNI and alternate IPs (as mentioned above). . This requires we provide a list of certificates to check for, of course, but it should be possible. I suppose this can be done via properties, just need to play with it a bit in Groovy. Multiline properties would make that easier to manage I imagine. I have also noted that for graphs used in dashboards, we want to be able to select the "Bottom 10" not "Top 10". Worked around it by limiting the display to a max of 90 days, but still awkward. Being able to specify th
  15. I agree -- presence of explicit wmi credentials should mean those are used instead of the collector service account. Worst case, there should be a property to indicate that, but I think that should be the default behavior.
  16. I searched for this today because I found the above method is not sufficient. Why? Because the collector is required to be in the domain (generally) and with Windows, time flows through the domain from the PDC emulator. Does this check work? Yes, it tells you if the collector is skewed from the monitored server. Does it tell you if the time is correct? Nope. If the PDC emulator is not sync'ed, this check will happily tell you your offset is low/zero. What really is needed is a way to check time against an independent source. We ran into this yesterday when we were asked why a server s
  17. Correct -- I was surprised by the warning on the instance group instructions that basically said this was not the normal behavior. I could potentially see a need to go the other way, but then overriding the default group threshold at the instance level would take care of those exceptions. Please please please fix this! It is impossible to maintain standards for thresholds without some way to bind them to groups. This leads to clients becoming angry when they add new volumes that should be handled in a uniform way and a lot of manual effort. It is not even possible to see how the thres
  18. I found yesterday that LM explicitly does not support defining a threshold for an instance group so all instances within automatically inherit from that unless overridden at the instance level. Compared to the way it actually works, inheriting would be far preferable. As it stands, you must remember to update the thresholds every time you add an item and you must remember what threshold you used for the group. I assume this is simply because there is no storage associated with that setting on the group level. I have seen other topics related to using instance groups for alert routing, but
  19. There are quite a few datasources that require direct access to ESXi hosts when the data is readily available in vCenter. This is painful since establishing permissions to hosts in most environments is tricky at best (confirmed after extended testing done yesterday) and the documentation provided is not complete. The problem is if you just point current DS definitions to vCenter (e.g., hardware), data is acquired, but the DS has no per-host data, it just grabs the data for one host and displays it in that subtree.
  20. Yes, please allow for filtering of report elements as described above (using same sort of filter as in Active Discovery perhaps?). Another related idea is to be able to take a dashboard and send it as a report -- if you do all the hard layout work there, sure is a shame not to be able to leverage that in the email rollup reports clients request! Thanks, Mark
  21. Please enhance reports that can include details on datapoints to permit filtering, for example, based on expressions (e.g., WinVolumeUsage- > 70%) or any other method that might make sense there (something akin to the way filtering is done in Active Discovery. Top 10 is a decent placeholder for now, but general filtering would be very helpful.
  22. Please provide a method to browse what alerts have been sent and by which alert rule if possible. When a client asks why they received an alert (or if one was sent) it is critical to be able to look up this information.
  23. I was a bit surprised yesterday to find that there is no support (other than basic SNMP interfaces, etc.) for HPE Comware switches, like the HP 5900AF datacenter switch. This seems like a big gap in coverage for one of the more common switch vendors on the platform that is by all accounts the future (Procurve is still a thing, but HPE is more and more moving toward Comware). Please add support for HPE Comware switches similar to what we get on the Procurve side! This would include FRU, sensors, clustering (IRF in the 5900AF, for example) and so on.
  24. Please add support for Riverbed Steelhead appliances. Pointer @ wrt available SNMP OIDs. Thanks, Mark
  25. Several related ideas: * please make it so Web Service checks allow detection of expiring SSL certificates, preferably via a parameter in the alert tuning, at least 30 days by default. * please adjust the SSL_Certs datasource to also check validity. The current script is in a JAR file, so hard to see how to adjust on my end -- it may already have what is needed. We had a cert loaded yesterday with a broken chain, which make it invalid, but the DS happily reported it was expiring in 1110 days during that period. * same as above, for F5 VIP certificates, if possible (not clear t