• Content Count

  • Joined

  • Last visited

  • Days Won


Everything posted by mnagel

  1. mnagel

    SDT "groups"

    We have clients who have planned maintenance on specific locations requiring SDT. This should be easy, but in fact is not at all easy and in fact is very error prone due to the level of manual effort required to think through what all is impacted by a "site" outage. Each different element requires a different approach to setting downtime. For example, we recently had a location maintenance notice and had to think through all of the following: * resources at the location (reasonably easy due to using location-based device groups, even though using those completely breaks the RBAC security model) * websites used to monitor that location (internal and external) * collectors at the location * we are not using the new service feature yet at this client, but that would be yet another SDT requirement It would be far simpler and less error-prone if all of these could be scheduled for downtime at one time based on the fact that they are impacted by the location maintenance. I was thinking one option would be an SDT group that could contain all element types, then that group could easily be scheduled in one fell swoop.
  2. mnagel

    SDT "groups"

    Any thoughts on this? I really hate telling clients stuff like... "ok, here's what you do -- first navigate to resources and add SDT for all your hosts at the site. now, also go to the firewall external check and maybe website external checks and add SDT there. and also go the internal cross-site ping checks and add SDT there. oh yeah and you also need to add SDT to the collector" and such. Very clunky and error-prone. I should be able to setup an SDT group spanning all monitoring types representing a logical unit (like a site) and tell then to just set SDT for that logical unit. Again, constantly having to perform manual tasks like this creates pain and opportunities for error -- please make the computer do the job it excels at for us.
  3. mnagel

    resource timezones

    When will the resource timezones be implemented? Per-user timezones definitely helped, but then development just seems to have stopped. There are many reasons this is needed, for example, if an alert threshold or an alert rule has a time range, it should apply to the timezone of the resource (or at least be able to). Chart display relative to the resource timezones is another use case.
  4. mnagel

    Dashboard Templates

    This same problem exists in much of LM -- encouraging cloning with lack of an inheritance feature is the root cause. I agree this is needed as it is needed for LogicModules and pretty much anything in the system.
  5. If a step fails in a website check, the step description should be produced in the alert. I am very tired of fighting with the system to get it to do the correct/obvious thing and my clients find it ridiculous to have to dig around to know what is actually happening. Please make the computer do the work so we don't have to.
  6. The real problem with all of this is lack of full template support (with conditionals and other logic structures found in Jinja2 for Python or any of the various Groovy engines (e.g., Since everything is static with value substitution, you get stuck very fast trying to generate useful results. The per-datapoint alert template is a decent workaround for datasources, but there is nothing similar for any other logicmodule, unfortunately. We solved a lot of it by sending all the tokens via an email integration where they are unwrapped and processed before generating a ticket with the result, but you need tokens to make it work. And we get stuck due to some odd design decisions (e.g., acks are not sent via the custom email integration...because). I once was asked to include a documentation note for a client in their alerts (only theirs), but there is no way to do this short of editing every template, including datapoint templates. This is what support told me to do -- for real. You can't use a custom email integration because you lose the ability to respond to ACK/SDT and you don't get certain messages, like ACKs.
  7. I am not sure exactly how to describe this other than by example. We created an API-based method a while back to control alerting on interfaces based on the interface description. This arose because LM discovered interfaces that would come and go (e.g., laptop ports), and then would alarm about the port being down. With our change, those ports are labeled with a string that we examine to enable or disable alerting. The fly in the ointment is that if an up and monitored port went down due to some change, our clients think they should be able to change the description to influence behavior. Which they should. Unfortunately, because LM will not update the instance description due to the AD filter, the down condition is stuck until either the description is manually changed in LM or until the instance is manually removed in LM. Manual either way, very annoying. My proposal is that there should be a way to update the instance description even if the AD filter triggers. Or a second AD filter for updates to existing instances. I am sure there are gotchas here and perhaps a better way exists. I considered using a propertysource, but I don't think that applies here. The only other option is a fake DS using the API to refresh the descriptions, but then you have to replicate the behavior of many different datasources for interfaces.
  8. The more I think about it the more I like the idea of a checkbox in the datasource to bypass filtering for existing instances -- this would cleanly solve this and other problems.
  9. mnagel

    Ookla Speedtest

    After a few failed attempts to get this working on Windows via Powershell (works, but too inaccurate), I punted and used speedtest-cli. If I can replicate that into Groovy, then perhaps it could be universal, but Linux-only for now. 4CN9AA This will be held for code review, but it is very simple code :).
  10. mnagel

    Ookla Speedtest

    @Cole McDonald Sure, no problem -- see below. The Ookla one is not much better in reality because the web method is not as accurate as the application socket method, but at least it tried to do bidirectional testing. I can't claim credit for this, I found it and modified a bit to work in LM. I set the mirror site as a property since auto-selection does not work with Mark $size = 50; $mirror = "##speedtest.testmy.mirror##"; if ($site -eq "") { $url = "${size}MB" } else { $url = "https://${mirror}${size}MB" } $path = "Out-Null" $WebClient = New-Object System.Net.WebClient $mbps = "{0:N2}" -f (($size/(Measure-Command {$Request=Get-Date; $WebClient.DownloadFile( $url, $path )}).TotalSeconds) * 😎 Write-Host "Mbps="$mbps
  11. It has recently become clear to me that when you have multiple devices with the same instances (due to clustering or otherwise shared resources), there is no way to collapse these into a single alerting instance without a lot of manual and maintenance-intensive effort. The most recent example I have run into relates to vSphere with shared datastores (i.e., any implementation with vCenter, vMotion, etc.). In those cases, LM fails to distinguish the datastores presented at the cluster level from host-local datastores, and all are presented in parallel. With a vCenter and 8 hosts, this generates 9 alerts per clustered DS. I have been told by support I am out of luck on this for now, which is unfortunate, but I had an idea that could be used for this case along with other similar cases. Sadly, the name "Cluster Alerts" has already been taken, but I would call this idea that as well, just with a different methodology. So either an extension to Cluster Alerts or some new name TBD. The idea is to collapse specific instances within a group into a single instance at the group level. This way we can pick which instances are clustered (still manual, but manageable) and ensure only a single alert and view of those instances. Ideally, LM could be smart enough to identify the clustered elements and do this automatically, but this method would apply to any similar situation even if the API does not reveal the cluster binding details like ESX does. Please consider adding this enhancement soon! Thanks, Mark
  12. mnagel

    event handlers

    In Nagios, there is a concept of an event handler that can run to try to fix problems (e.g., restart a service, remove old files, etc.). I see no similar capability in LM and it is of course something customers want to see happen. For example, I just deployed a custom DS for someone to check for too many files in a share, indicating a service problem. Once I implemented that, the next question was "Can you restart the service when that goes into warning?" I see no facility for this in LM, but perhaps a custom alert could be used to trigger the behavior. If I used that approach, I would insert a custom HTTP alert into the escalation chain earlier on to give the problem a chance to be corrected, then I will have to create a secure REST API server to accept those and trigger the correct behavior. So in theory it could be done (if I am not missing something), but it feels like using a screwdriver to hammer in a nail. Thanks, Mark
  13. mnagel

    Cisco Warranty Status

    We used to have something like this for our Nagios notification script to include detail for Dell, HP and others that provide this data publicly, but we never were able to get Cisco data. It is available now via API, but not publicly. If LM is setup as a partner at the appropriate level, it should at least be technically feasible -- see and!serial-number-to-information/cisco-serial-number-to-information-api-reference in particular.
  14. Yikes! I recall discussing this issue with the dev team during beta review, and they did make sure the roles have the correct defaults. I just checked and the Map stuff is disabled for all our pre-existing roles (it had not been at one point during beta). I just spotchecked a client login we use to emulate their view, and there is no Map tab visible. You might still need to adjust one or more roles.
  15. mnagel

    RDP Sessions

    @WillFulmer I have not tried again since then -- did not seem to be a collector issue, more that the Powershell involved was too intense. Worked fine in a few cases, but when the server count was in the 80+ range (which is no problem normally), the collector bogged down and lost data until we disabled it. I have not had a chance to go over the code too much to see how it could be made more efficient. It may not be possible...
  16. mnagel

    more cluster alert improvement requests

    There actually is a solution for this now. See . Unfortunately, though this use case reflects a missing core feature that is missing, the solution is a premium license addon. I have had words with our CSM about this stance. They need a core version and an enhanced version, IMO. The other problem with it is it is "write only". If you decide to change an instance after it is created, the wizard that creates them is not available (last time I checked anyway). But it should fix this problem.
  17. There is not, unless there has been a change recently. There are some datapoints within the collectors you can use (which we do for widgets displaying what we are able). There is not full coverage for all the various things that can incur fees (or usage against commit). I had to write a script against the API to capture this, and even then it is not complete. I had to define a custom property to group the results by client as well (with a warning if I find an element without the custom property). My script currently does this: scans all resources, counts each regular and LMCloud type (deviceType==0 or deviceType==2) scans all configsource instances to count LMConfig usage (dataSourceType == "CS") scans all websites, counts each, noting whether it is internal or external I would definitely prefer something standard to get a complete picture of license consumption, especially since new separately licensed features are added from time to time.
  18. mnagel

    datasource migration function

    It is nice to hear this is getting some attention, but there are definitely more items needed. There are at least two different issues with LogicModule maintenance. The one in this F/R relates to upgrades (e.g., new version of VMware modules). The hope was that you could retain previous data by matching DS/DP in the old to the new in a merge operation when importing the new datasource. I get it is complex, but it is also frustrating to see loss of historical data treated so casually. The one you are referencing is about module parameter override preservation. I like what you have above, but really, the correct solution for this is to allow linked cloning with inheritance and individual override of any or all elements in the module. There are already many examples of almost the same but not quite modules that could benefit from this. A great example of how that leads to problems is the excellent changes made by Steve Francis to support ActualSpeed ILPs for interfaces. Because that applies to only one (albeit common) case, all other interface DSes fail to get the same behavior. If that was done in a base interface module and the rest inherited from there with overrides/additions for there specific needs, that could be a general improvement for all interfaces. Lacking that, I would ask that in addition to the above, the following also be preserved: additional datapoints (including scripts, etc.) changes to datapoint settings (perhaps not scripts, but alert settings including templates) I am dealing with a specific use case right now where a fairly complicated NetApp DS needs to be updated to alarm when space available drops below a threshold. We can add a custom threshold, but then the alert template is default and the units are bytes. I can fix this, but this creates a maintenance headache regardless of whether I update in place (new versions will kill my changes) or clone (new versions require manual review to sync into the clone(s)). That said, the filter preservation would at least avoid issues with my changes to BGP- (filters admin down sessions) and changes to the collection interval for HP network devices (default is set to 1 day instead of 1 hour), so I will take what I can get!
  19. I have run into too many cases now where a new but slightly different DS is setup due to LM support actions, upgrades, etc. and the result is lost data or noncontinuous data. A good example I recently encountered is with NTP. The standard DS was not working in all cases. I was given a new DS that uses Groovy, and it works (which I appreciate!). But the datapoint list and names have changed, and even if they had not, there is no way to maintain data history from the old DS to the new DS. My recommendation is to add a migrate function so you can indicate how to map old to new datapoints in such a situation and thus avoid data loss. Building in a default migration ruleset into a new DS would be a bonus -- this could allow for zero-touch data migrations in at least some cases. Thanks, Mark
  20. We run into situations where widgets display an error of some sort due to any number of reasons, usually when datasources are changed or resource group structure is updated. The point is, there is an error displayed in the widget instead of the data, and the fact that is happening is pure stealth currently, just waiting for an embarrassing moment with clients. I propose that this meta-issue be reflected in a way it can be detected in advance. If the API can pull that information, then a check could be setup within LM or via external script. I just checked our JSON dump of all widgets and nothing like this appears to be in place currently.
  21. The newer filter capability is appreciated, but would be even better if more complex logic could be applied (AND/OR/NOT for multiple filters) to really focus on specific types of traffic while excluding others. For interfaces, glob matches would be very helpful. For src/dst address match, please allow for prefix matching as well as host matching. Thanks, Mark
  22. mnagel

    Search In Alert Tuning

    Please fix this! I am looking at the top level of one client's Alert Tuning page and it has 267 items. Not being able to search is very frustrating. Search box presence is extremely hit or miss in general and needs UX attention, please!
  23. mnagel

    VMware Host Network Interface Status

    This is not exactly what you want, but it works (we also had unhappy surprises on this). You just have to disable alerting on vmnics expected to be down. And, you will have to wait for LM code review. 2EADXJ
  24. You can do this with propertysources that add a category to devices. The only problem with those is you have no control over how often they run, but at least now you can trigger them on demand (manually).
  25. mnagel

    Palo Alto Improvements

    Here are some datasources we added to get better information on Palo Alto firewalls: Certificate Status: KFWLJ9 High Availability Detail: EMXWRR (this one includes a bunch of HA info, including HA link status, compat status and so forth. Many auto properties for reference on the local and peer units. All datapoints currently use the default alert templates, but I am hoping to extend that and leverage the auto properties for those messages) Support Status: 3YJJCZ License Status: DXEAP4 All use the XML API, so will require security review (no idea how long that takes).