mnagel

Members
  • Content count

    169
  • Joined

  • Last visited

  • Days Won

    36

Community Reputation

51 Excellent

2 Followers

About mnagel

  • Rank
    Community All Star
  • Birthday July 17
  1. We regularly encounter situations with clustered resource where an alert will always be active on a standby device. For example, the default for Palo Alto firewalls interfaces is to be operDown on the standby firewall. This leads to similar alerts on the connected switches. What I really care about is the status on the active member, but we get tons of alerts on the standby. You can't just disable them as the standby may be active at some point. What is really needed (and again, this is a general issue -- Palo Alto is just one example) is the ability to group equivalent instances. I hoped the Cluster Alerts feature might help, but it is not even close to fine-grained enough. I want to group (in this example) interface pairs so that the alarm triggers only when both instances are down. This applies to many similar situations in real life monitoring, and it is very painful to have to explain to our customers why this basic feature is missing. It is similar to the previously discussed device dependency issue, but different enough that I think it deserves its own focus. Thanks, Mark
  2. I have a longstanding ticket in progress on simply handling Cisco router and switch logs (which do not work due to undocumented message format requirements), but yep, I gave up. It is just not an area LM is focused on. I moved my log handling to SumoLogic and place collectors on the same boxes I have LM collectors on, hence my question here. I still want to relay information from SumoLogic into LM, but it will be strictly metric-based data based on API queries to SumoLogic. Still working on that...
  3. Timezone per user account

    @Ali HolmesI saw a timezone field added to the user object in the v104 update, but nothing in the profile editor. Is that just preparation for this change, or can this be managed via the API and work now?
  4. I would like to see capabilities added to logicmodules so that custom properties leveraged (device or instance) could be presented as not only possible in the corresponding property definition sections, but ideally presented in a meaningful way. A very good example of this is the awesome work @Steve Francis did recently for the interfaces module. Unfortunately, if someone wants to leverage ActualSpeed and such, an ironclad memory or a dive into the LM technical notes and/or source code is necessary. How cool would it be to indicate the additional custom properties as fields in the UI along with the type of data (fillin field, checkbox, radio buttons, etc.) and some hover text or similar explaining the purpose of those. If nothing else, additional custom properties available in a scope should be included with the autocompletion -- right now autocompletion for properties is a hardcoded subset of what is actually possible.
  5. I solved this with the REST API -- I have a script that pulls all the config instances, then commits them into a git repo (one per device group in my case). I use GitLab with an email push trigger, so I get reports on updates more or less as they occur, but obviously that could be handled other ways. I find this beneficial for many reasons, including: * keep a separate copy out of LM so some errant action in LM does not destroy config data and history * enable searching for configuration elements across multiple files (impossible within the UI) * build tools to validate configs against baseline templates or other sorts of analysis * generate way better diff results (the current UI diff viewer REALLY needs a rewrite) * I get to find out all the times LMConfig flakes out, like when it zeroes out config instances if NVRAM is temporarily unavailable I would like to post those here, but you can't post text files it seems. Will see if I can get those into github....
  6. add property groups

    Currently, the only way to define standard properties for a bunch of devices is to assign those to a group, then assign devices to that group (we do this most often with location, but it applies as well to other common settings like SNMP parameters, pointers to support information, etc.). This works for devices, but leads to numerous problems (e.g., it breaks RBAC), and those do not apply to other objects like Services (I refuse to start calling those Websites as a ping check to a firewall is not a Website). A solution for this would be to allow creation of property groups that can then be bound to any object. The main goal is to avoid repeatedly defining the same thing across many objects so changes later can be made efficiently and without error. Thanks, Mark
  7. Currently, the SLA widget will report the amount of time that all device/datapoint pairs in a group meet the SLA. It also is desirable to show the average for each device/datapoint within the group as the result. The current method can be explained to folks after some time, but it is a case of "interesting but not helpful" (in most cases). I would like to see an option to calculate the result either way, please...
  8. As of this point, the syslog feature in LM is not working very well, it would be appreciated if it was easier to disable the on-by-default listener. I know it can be done with manual edits of the collector config, but I would prefer either off-by-default or a simple checkbox to enable/disable.
  9. widget status via API

    One of the most disappointing things encountered during a system review meeting with clients is to find widgets reporting "no data" because something changed or silently stopped producing results. In many other API endpoints, I have found there is a ton of state data embedded, but widgets have really nothing other than the lastUpdated* fields. I would like to see at least a field that indicates there is no data for some or all datapoints, just like you would see in the UI. Then we could generate reports on faulty widgets and be able to remain proactive. Thanks, Mark
  10. object versioning

    There are currently far too many opportunities to commit errors in LM from which is is difficult to recover since there is no version tracking. Ideally, it would be possible to revert to a previous version of any object, but especially very sensitive objects like logicmodules, alert policies, etc. I have created my own method of dealing with this, which leverages the API to store JSON streams of all critical elements regularly, changes committed via git (certain adjustments to the original results are needed to avoid a constant update stream). Recovery would be very manual, but at least possible. This would be far more useful within the system itself. Thanks, Mark
  11. Better BGP

    GE9YGG This modification of "BGP-" accounts for sessions that are intentionally shutdown by checking both the PeerAdminState OID and the PeerState OID. It required a hack to find if the PeerState is != 6, but it works.
  12. Currently, it is impossible to grant access to see unmonitored devices in LM without granting full admin access. In shared portals, it is just one more of the monolithic elements (like alert rules and escalation chains) we can't share, but even in a dedicated portal it is hard to justify full admin for this function. Please add that to the Settings section in the Role permissions. In the meantime, we can use scripts to extract and report on those, but folks prefer to use mainly the UI if possible. Thanks, Mark
  13. Show "Bottom 10"

    Thank you!
  14. I have clients who rightly expect that if they ACK something at warning, they will get a new alert if the issue worsens to error or critical level. I have to explain to them that LM doesn't work this way, because once something is ACK'ed, it is ACK'ed forever (until the alert clears), at least it certainly seems to be the case and there is no documentation to the contrary. Having used other tools that do support more sophisticated methods, I recommend adding the following features: * option to reset an ack when state becomes worse (this can admittedly open a can of worms when states transition rapidly, so perhaps this should still require a delay timer) * option to manually clear an ack that might have been set by mistake * option to set a maximum time for ack's to remain active (hierarchical/inherited timer value, definition set in datapoint when desired). this is because some issues are so important you don't want to let them just become ignored indefinitely in error. Some of these can be managed via API-based enhancements, but it would be preferred to have direct controls. Thanks, Mark
  15. custom speed for interfaces

    I had to solve this in one portal by creating a complex datapoint in the original DS (I still need to take a look at the revised DS in this thread). Getting that combined state is harder than you might expect because of a silly design flaw in the DP expression language -- there is neither a 'not' operator nor a 'neq' operator. So in mine, to know that something is admin up and oper "not up", I have 'ActualStatus' calculated as "and(eq(AdminStatus,1),eq(Status,2))" instead of the more correct "and(eq(AdminStatus,1),neq(Status,1))". Simple to fix in the language interpreter I am sure, but I am not holding my breath. The one I did works well enough, though, so I can get reliable alerts for unexpected ports down on core devices.