mnagel

Members
  • Content count

    145
  • Joined

  • Last visited

  • Days Won

    34

Community Reputation

43 Excellent

1 Follower

About mnagel

  • Rank
    Community All Star
  • Birthday July 17
  1. More Granular Role Access

    We found this past week it is far worse than we originally thought. Even within a portal for a single company, it is not possible to provide access to different teams by role to delete a device based on group membership, even though that group is marked for Manage within the user's role. Delete is blocked if the device also exists in some other group they do not have Manage on, but this is very common as devices are placed in some groups for alert routing, location binding and so forth -- providing the same access to those is tantamount to full admin access. When we brought this behavior to support we got the canned "it is supposed to work that way, not a bug" response I am so fond of. As a result, we end up either having to make everyone admins or we have to handle those tasks for them. IMO, this is a bug, not a desirable feature, but here we are...
  2. There is a MIB, so should be reasonably straightforward -- surprised it is not already there, but I see no matches for hsrp or vrrp in the DS list :(. http://www.oidview.com/mibs/9/CISCO-HSRP-MIB.html Now I feel like I gotta try to get that working....
  3. I am not quite sure how this would end up working out, but I know the current behavior is not really working out too well. I am having some trouble articulating the issue to LM support and in general, but here goes... If an instance goes into alert and then later is removed from the underlying device (in reality, not in LM), the instance datapoints all generate "No Data" and the threshold alert will never clear until the instance is removed from LM. It seems to me that after a long enough time of "No Data" the alert should clear on its own, but I am not sure how long that ought to be. Certainly true for months of no data, but probably less than months. What I am sure about is that in this specific case, which was for NetApp LUN volume usage, none of the datapoints alarm when the LUN is removed and THAT seems wrong as I should darn well have a way to know if a LUN goes away unexpectedly. IMO, there should be a canary datapoint for anything critical like that marked to alarm on no data. If there are no such datapoints, I do not understand why DS developers would not just set instances to auto-delete. My main goals here are twofold -- we should not have alerts forever for stuff that is gone, and we should know when stuff that matters is gone and not be bothered when it does not matter. And when I say "we", I mean my clients, who get frustrated by nonsense alerts, as they should.
  4. Time Zone setting at the User level

    Pretty sure not a soul from LM has chimed in on this or other related threads. I get questions on this from clients who use common sense and assume this should just be possible, and have to give them very sad answers. I escalated to my CSM, but no response as yet.
  5. Ping only devices

    I was also going to note that internal service checks are the solution, with the caveat that you have zero ability to control the alert format for specific instances -- there is only one global service alert template. With Ping_Multi-, you can at least clone it and generate a custom alert (except you can't reference the WILDVALUE token, which is super annoying). Since LM is on this thread, I will also point out that the Service functionality is only a small subset of what other vendors provide (e.g., Pingdom) with either the same or higher per-device cost. I would love to see improvements in alert handling as well as for check scope (more than just port 80/443 for HTTP and general port checks, to name just a couple).
  6. add key/value store (redis or similar)

    One obvious missing token I have raised to support several times is WILDVALUE -- it is not possible to reference this in alerts, which means you cannot say in an alert which input value triggered the alert. And of course, not a bug, but a feature request, which I hear frequently. You also cannot pass but a few specific tokens into PowerShell scripts, and the limitations are not well documented. This specific issue is related more to recent security monitoring we have been asked to implement. It is not necessarily the correct tool, but then the whole point of using LM is to avoid a plethora of tools. When trying to encode the expected security settings for a Windows folder into a field, we found there are size limits so had to just use fields to index hardcoded values in the script. A key/value store would help. It would also help deal with monitoring for changes in values, like group membership, etc. I can roll out my own and figure out how to get that sync'ed on all the collectors, or it could be a service provided by the collectors themselves (preferable).
  7. It is becoming very clear we cannot rely on parameters in LM to drive scripts, either because some tokens are mysteriously unavailable for use as parameters (discovered only when assuming they should be), or because tokens have limitations on length that preclude using them for data passed to logicmodules. Please consider integrating a distributed key/value store like redis into LM, with data replicated among collectors. This would help with access to configuration data as well as cross-run results within or across datasources. Ideally this would work natively with Groovy and PowerShell.
  8. custom speed for interfaces

    Excellent on the token thing -- looking forward to that! What I found when I removed the operStatus AD filter was that a bunch more interfaces reported alarm almost immediately. I think my script to deactivate alerts would have eventually caught up, but it was super noisy so I quickly reverted. I need to look at the new way as my method was a necessary evil given the tools available. I did notice later, though, that failure to update the description for down interfaces made my script less useful than intended. Nonunicast is a funny thing. Acceptable levels tend to vary in different environments (I have seen Nexus 7K cores handle 50000pps without breaking a sweat -- not good, but not deadly to the switch), but there are levels that are absolutely bad in typical environments. I normally do not set thresholds on percentage as this could trigger for ports with in otherwise inactive hosts seeing not much other than nonunicast traffic. A rule of thumb is that for access ports, under 200pps can be safely ignored (though it is still high). Trunk ports will tend to be higher as you will see combined levels for all VLANs on the trunk. When we see "freak out" levels, they are in the 1000pps or higher range. Translating to LM-speak, I would start with "> 200 1000 2000" (but again, hard to set just one good threshold). Thanks! Is the format documented, or is it literally the name within the instance and it is just in scope for the datapoint instance at alert time? How are clashes with device property names avoided, I guess is my real question...
  9. I have run across cases where we need a replica datapoint to drive different behavior than the standard datapoint behavior. I am normally told by LM support when I need this to create a new virtual datapoint equal to the original value and then use that to achieve my goals. This sounds simple, but the truth is this leads to divergence from shipped datasources, which creates maintenance headaches. If we had a way to crossreference datapoints in another datasource, we could achieve this in a much more manageable way with insulation from future DS updates. This actually relates to other features I have requested (multiple thresholds for datapoints and linked clones, to name a couple), but perhaps in a simpler way to achieve these goals. A simple example of this is the reboot alert. I would like to know the following: * system rebooted (check, this is present) * system has NOT rebooted in at least timeperiod X (nope, would need a replica DP to create the threshold, diverging from shipped code) * system uptime is approaching counter wrap territory (similar to above, but now another replica DP is needed) If I could create a new DS that is populated with references to other DS/DP values, then I could make this happen in a way that survives DS updates (in the general case). The next level of this would address alerts over longer periods by allowing the crossreferenced DP values to be functions, including "average over last N seconds" and so forth.
  10. More Granular Role Access

    The entire RBAC mechanism is way too coarse. I had a client ask yesterday why they can't disable alerts for a device group. As far as I can see, that comes along with Manage, and I see no reason why this should be true -- I don't want them to have that level of control but it is all or none -- RBAC granularity improvements are sorely needed.
  11. Do no run autodiscovery on devices in SDT

    Yes, please! This is similar to my recent request as well:
  12. Complex Datasources

    Adding to this (since it was recently referenced and I am in a similar situation). There are several areas where improvements could be massively leveraged related to this: * allow easy introspection within scripting -- currently, I could get information about other instances from other LogicModules, but I have to use the REST API, and it is from scratch in each case. I am trying to put together an eventsource that reacts to an alarm in a service -- this is basically not possible without the REST API, and that is a lot of code to deal with each time. * allow library management for code so we don't have to cut and paste virtually identical code into umpteen places. you might say, build your own libraries and push them out, but there is no real facility for file distribution across collectors in LM. Imagine how much better LMConfig modules could be written and maintained if libraries existed in the framework. As it stands, with each weighing in at 1000+ lines (most of which is duplicated), I would not dream of editing them myself. If they were done more like Oxidized, that would be a very different story. With library support, I could even deal with the pain of introspection since then at least the REST API portion can be maintained only once.
  13. Dynamic Tables

    @Ali Holmes Thank you! One more desirable thing would be the ability to reference properties (device and ILPs), not just datapoints. This would enable presentation of asset information, for example, assuming propertysources are producing results.
  14. custom speed for interfaces

    I can live with a default of 1000 as it would definitely catch storms, which is the main goal. We have settled (in our other tools) on 100-200pps as a default with higher levels for trunks. We also try to watch for such things over extended periods, but average-over-time is a different FR I am awaiting action on :). When I listed "> 200 1000 2000", was trying to keep to the "LM way" :).
  15. Disable/Enable Global DataSource

    The Applies To method is a non-starter -- it will destroy all historical data. This is definitely needed! I recently submitted a similar FR related to disabling all activity on a device (AD and polling), since the only other option is to delete a device (with similar downsides).