mnagel

Members
  • Content Count

    501
  • Joined

  • Last visited

  • Days Won

    92

Posts posted by mnagel

  1. Sure, you can use Service Insight for this, but it is a premium feature, which is using an expensive mallet to handle something that should be available without that extra cost.  Or, there should be a Service Insight light for this stuff, leaving the costly part for the intended enhanced features of Service Insight (like Kubernetes).

    My recommendation on this was to extend cluster alerts so you could at least match up instances.  My use case at the time was to detect an AP offline on a controller cluster.  There is no way to do this without SI, which as you say is complex, and it is an extra cost.  We need stuff like this in the base product.

    • Like 1
  2. There is a way to do this, but it is not well-documented and there is no UI exposure for "No Data" alerts, you have to dig around the module sources to find them (because it is very hard to put an indicator in the alert tuning thresholds I guess). 

    We have standard alerts on 2 datapoints that have No Data alerts and no other alerts.  The first is for "Host Uptime" -> SNMP_HostUptime_Singleton -> Uptime and the second is for Uptime- -> * -> UpTime.  If a host stops responding to SNMP, those will trigger.  We keep them near the end of our alert policy to generally report to our team across all clients.

    • Like 2
  3. What you want is a dynamic template processor, but all we have is simple token substitution and no indication that will ever change (I have asked repeatedly for years).  You can route alerts to an external integration, which is how we handle transformation of tokens, but you lose some stuff when you do that depending on how you integrate.  For example, we use external email integration into our ticket system with a filter that handles the transformation, but custom email integrations do not get the same handling as builtin email for certain things (e.g., you do not get ACK or SDT notices).

  4. My two cents -- I gave up on using syslog and most other eventsources a long time ago due to lack of basic correlation features. At the time, Cisco logs weren't even parsed correctly in our client environments and it took forever to get that dealt with.  We now use SumoLogic for log processing since then since we can run queries on the data over time and get meaningful results (and if needed, tie to LM via the SumoLogic API).  LM also realized the existing stuff was a bit limited so bought a company and added LMLogs as a premium addon.  That is fine, but adding some basic ability to correlate "regular" events (even just counts over time based on custom cross-event ID extraction) should be included in the base license.  We still use it for Windows event logs to have the extra info visible, but we always have to warn folks not to bother ACK'ing any that generate email since that does nothing meaningful.  I have asked that the ACK functionality be removed for eventsources as well (SDT still is helpful).

  5. 1 minute ago, Todd Theoret said:

    Thank you Stuart. Any chance there is a potential solution being developed which would allow....manually tagging....devices?

    So this has been an issue for us a lot -- everything was tossed into the topology umbrella for alert suppression with no easy way to manually create dependencies.  There are many topologies that are simply not discoverable, like multipoint/mesh WAN topologies and really anything not handled by topology sources. 

    The good news is that some kind support tech provided me a Manual_Topology module that linked various devices manually that eluded auto-discovery.  The bad news is it is awkward and leverages hardcoded device names and MAC addresses.  But, it is possible.  IMO the UI and/or API should be extended to support manual links.  It is a last resort of course, but there are common cases where it is the only resort.

  6. Check Mike Suding's blog page -- lots of cool stuff, including this. A bit old, but probably still works :).

    http://blog.mikesuding.com/2016/09/20/restart-a-service-alert-if-restart-fails/

    As far as the debugger, yeah -- that stuff freaks me out a lot given that LM more or less requires Domain Admins on collectors (really should be Performance Monitoring Users, especially after the recent SolarWinds incident).  You can run those debugger commands from the API as well, even more scary.

  7. I 100% agree this is needed -- we have to hack around this all the time with escalation chains that have one or more empty stages, and still that does not prevent alerts from registering in the system.  But this is just one case that would be trivial to solve with DS inheritance, something I have been pushing for well over four years now. The issue with creating new DSes is they are then freestanding clones, meaning each must now be maintained independently (and this is commonly pushed by support as a solution, sadly).  If we could just get inheritance done (not just for DSes, but that would be the highest impact) it would be easy to make a copy that does what you want with changes only to parameters you desire while still getting the benefit of updates on the parent module and minimal maintenance requirements.  It would be important that child module applies-to expressions are automatically excluded from the parent chain, too.

    A related change for alerts that would not be solved by inheritance but I had also benefited from in our previous tool is threshold calculation over time.  For example, I don't care if CPU is high on a Windows server for a few minutes, but I do care if it is high for an hour. I also need to know if the average is high over a period of time when the actual level may be oscillating during that period and LM would not generate any alerts otherwise).  With Nagios we did this by calling back to the pnp4nagios RRD data to calculate averages, slopes, etc.  This could be done in LM if using the API from within modules was supported properly, but I refuse to go there until there is library support within the module system.

    • Like 1
  8. 22 hours ago, mnagel said:

    Thank you!  I have been asking for this via "proper" channels for some time with no results -- will try it out ASAP as I have an 8320 cluster waiting.

    FWIW, I recommend using a standard property name alongside the ssh.user/ssh.pass (e.g., lmconfig.enabled) to allow disabling this premium feature at the group (client) level when it has not been subscribed to.  I know it is an uphill battle to get those all fixed, but I sure wish it could be done.  We still cannot use the new AD and DHCP modules due to lack of ability to disable LMConfig per client.

    I guess not ASAP:

    This LogicModule is currently undergoing security review. It will be available for import only after our engineers have validated the scripted elements.

    I guess I will check back at some indeterminate date in the future :(.

  9. Thank you!  I have been asking for this via "proper" channels for some time with no results -- will try it out ASAP as I have an 8320 cluster waiting.

    FWIW, I recommend using a standard property name alongside the ssh.user/ssh.pass (e.g., lmconfig.enabled) to allow disabling this premium feature at the group (client) level when it has not been subscribed to.  I know it is an uphill battle to get those all fixed, but I sure wish it could be done.  We still cannot use the new AD and DHCP modules due to lack of ability to disable LMConfig per client.

  10. 20 minutes ago, Stuart Weenig said:

    The author needs to verify that the module has been published to the public repository. In simple cases, it's automatically made public at that point. If there is code in the module, it will undergo manual security review by the LM staff before it is made publicly available.

    Almost certainly there is code as Palo Alto checks virtually always require API access.  Review has seemed in most cases I have been involved with to be a mostly ad hoc process (or if not, definitely opaque). I suggested in one of our UI/UX meetings that there be a "Request Review" button or similar to create or escalate a request for security review.  As a bonus, use a ticketing system (this would be welcome for feedback as well, which as I understand generates internal-only tickets).  A unified customer visible ticket system for feedback and module review would be very helpful.

    • Like 1
  11. Been there, done that -- you can't reference those in widgets, sadly.  You have to just create your own datasource that sets values equal to the properties and then reference those.  First time I ran into this I wanted to chart device usage against subscription levels, latter of which was a property.  In ours, the collector is a Groovy script that does nothing (not sure why that was how we did it, but it works).  The CDP is just equal to ##property## in each case.  It is Groovy mode, but the code is literally just that.

    image.png.7b827a81fc6ce526b78f5882e504eea8.png

    • Like 2
    • Upvote 1
    • Thanks 1
  12. FWIW, having also come originally from Nagios, I miss the ability to transmit arbitrary string data back via alerts. Some of this can be emulated with auto properties, but those can be set only during discovery not collection.  I posted a feature request previously to allow definition of enums that can be bound to datapoints (global values and overridden values within specific datasources/datapoints). these could then be used to avoid the current awkward legend method and actually show the intended purpose of DP values where needed via tokens. Imagine a line that showed the actual meaning of the current value instead of a long truncated legend line that makes you dig around for what it means.

    I also think it should be possible to improve the property menus to leverage more advanced typing and UI.  For example, a property might be just a string as now (preferably with better input box control), or it might be a radio button, selection menu, etc. so that folks using properties can easily find what is supported and what values/ranges are allowed.  This also would be something where those hints would be defined within logicmodules primarily, but it should be possible to define them more generally (at least the typing/UI definitions, which could then be bound to properties that are used within modules.  This is not strictly related to the topic, but is about readability and usability so I tossed it in there, too :).

    • Like 1
  13. On 3/6/2020 at 6:35 AM, Stuart Weenig said:

    You would then need to look into changing strategies over to an EventSource. The EventSource would output words (like a log file) and you'd write a check to look for particular words to open a specific kind of alarm.  

    Eventsources don't support embedded Powershell, though they certainly should.  You can upload a script though.  That said, eventsources are also almost entirely unsuited for monitoring, more like additional information to see along with monitoring. Among other things, you cannot ACK them in a meaningful way due to lack of correlation across eventsource results.  I'm sure the yet-another-premium-module LMLogs will fix all those problems, though.

  14. I would not hold your breath -- I have had to fight just to get and keep SPF enabled on our email.  Regardless, even if you could use the builtin alerts with a distinct From address, you would still have portal links embedded in the message that reveal it is LogicMonitor.  You could do what we do and submit everything via a custom email integration (or a web integration via an API handler), then handle the data any way you like.  In our case, we feed the tokens into an actual template system to format messages using conditional logic and all that stuff missing in the LM blind token substitution method available normally.  We feed that transformed result into a ticket, but obviously it could be handled many different ways at that point, including re-routing via email with proper headers.  The downside of a custom integration is that there is a bug -- certain things LM simply does not send to those that are sent with the builtin integration (e.g., ACK and SDT notices). I have asked about this and in theory it might be fixed one day, but it has not been in well over a year since reported.

    • Upvote 2
  15. Here are at least two items that need to be added to make the dashboard token feature more useful:

    • adjust widgets that cannot use tokens so they can (e.g., Alerts, Netflow, etc.)
    • allow arbitrary tokens to be inserted as needed within widget fields (e.g., device patterns, instance patterns, etc.)

    A concrete example of the latter came upon me this morning.  We have multiple locations with similar equipment for which we want to display Internet usage details, one set per dashboard (cgraph and netflow widgets).  The edge device names vary as do the uplink ports to the ISPs in each location.  Cloning this dashboard solves virtually nothing as every single widget still requires editing.  If the tokens could be used, these dashboards could be cloned without the manual editing other than filling in the necessary tokens.  In some cases the tokens are insertable, but most fields do not allow them.  In this case, I defined various tokens like isp_1_name, isp_1_edge_device, isp_1_edge_port, etc. but could use them in very limited ways ultimately making the exercise pointless.

    As with many things, we can at least workaround this with the API (at least I believe I could with some effort), but it would be much more accessible to folks if handled within the UI.

    • Like 3
    • Upvote 1
  16. 5 minutes ago, mnagel said:

    The normal way I monitor services is via AD, but you would end up with a new instance wildvalue each time it was changed if you use the normal option (WMI-based datasource).  If you use Groovy script DS instead, you could strip the PID portion to build the wildvalue so that the data is stable.  There should be some examples of that in the existing datasource repo, need to dig around....

    Many examples of using WMI from Groovy, none that select from Win32_Service, but should be simple enough to adjust the query.  See Microsoft_LyncServer_StorageService as one example.

    • Like 1
    • Upvote 1
  17. 2 minutes ago, Stuart Weenig said:

    AppliesTo Functions is a misnomer. They should be AppliesTo Aliases.

    Yeah, brings back horrible memories of me requesting repeatedly the documentation on how to pass parameters and getting the most insane response from support :).

  18. 6 minutes ago, dcyriac said:


    Wish for a feature to capture the LogicMonitor settings and state of all the resources, resource credentials, websites, mapping, reports, exchanges and settings.

    Perhaps an offline back so that we can restore monitoring for environments that were deleted without having to go through the scans or process of adding devices individually.

    Or to revert LogicMonitor to a state before or after an account impacting change.

    Best we have been able to do here is a script leveraging the API to download as many endpoints as we are able to access with checkin to a git repo.  Works, but needs frequent tweaking as things change on the backend. Having a way to revert to a previous snapshot or similar would be very handy.  My script came about originally after I implemented alert rule resequencing with an error and lost some rules.  My latest incarnation of this script has an option to check the items as well for problems (e.g., broken widgets).

  19. Since I can't wait for this, we now have code to grab widget data for all supported widget types incorporated into our existing backup script (pulls virtually anything I can from the API into a Git repo regularly).  Most issues can be detected via exception (non-200 status code), some require a bit more analysis (no data in any line in a cgraph, for example). Working reasonably well now for the first phase, which is to be aware of busted widgets before we are embarrassed during client review. Next phase will be to analyze data more specifically to the context (once I figure out how to represent the widget check requirements).

  20. I am incorporating this into my resource check script.  For now, the first item I am testing is Windows_DHCP, which requires that the DHCP Server role is installed (we have an auto.winfeatures property that is populated by a PropertySource. I don't recall where we got it, somewhere in these forums IIRC :).  The PS code is simple:

    hostname=hostProps.get("system.hostname")
    my_query="Select NAME from Win32_serverfeature"
    def session = WMI.open(hostname);
    import com.santaba.agent.groovyapi.win32.WMI
    def result = session.queryAll("CIMv2", my_query, 15);
    println "WinFeatures=" + result.NAME

     

    If this list does not include DHCP Server and we have Windows_DHCP assigned, it will trigger a warning.  I plan to extend this to catch more stale categories.

    My check script tests a bunch of things, including lack of any FQDN or expected FQDNs, lack of NetFlow data (the new heartbeat datasource is not helpful there as it does not care if valid data arrives), and other stuff that can go wooorng.