• Posts

  • Joined

  • Last visited

  • Days Won


Posts posted by mnagel

  1. 1 minute ago, Stuart Weenig said:

    Agreed, a DataSource would fix the multiple alerts that results from an EventSource running.

    Although i think the larger question of alert correlation (multiple alerts being statically or dynamically grouped into incidents) is something you should be requesting from your CSM. Even something like occurrence counts on alerts would be good. The same problem happens with SNMP traps; traps can come in every minute and be about the same thing still in an unwanted state. Each one should just increment a counter on the alert. Counter thresholds should be something we can add to alert rules.  Even regular datapoints could benefit from this, counting the number of poll cycles/minutes that a particular metric has been over threshold. 

    Sounds like submitting to CSM should work, but here is usually what happens. "You should submit a feature request or feedback item."  To me, those have a pretty small chance of success, so I have stopped trying except in a few cases. I once was able to peer into the feedback tickets via export and discuss with our CSM, but those are normally complete blackholes.  Feature requests rarely result in any constructive activity and they lack basic support for escalation, voting, etc.  Really we need one ticket system to be able to track all of these things with suitable categories (which I have also suggested that multiple times).

    And yes, every event source should have the ability to correlate new events with open events. I have been pushing for this for a long time, but I suspect now the answer is "get LMLogs" and this will never get any traction.

    Being able to get data averages datapoints over time has also been a long-time open request.  This is important to look for issues where the status might oscillate, but overall levels are high (e.g., resource usage like CPU, bandwidth, etc.).

  2. 2 minutes ago, Shack said:

    This is doing exactly what we want but with one problem.  How do you stop a scripted event source from creating duplicate alerts every time it connects and runs?  Hmmm I wonder if I can do something with my Escalation Chain.

    It would be awesome to be able to suppress these IF it detected the same port was disabled and an existing alert was already active based on message matching or something.  I need a checkbox similar to the checkbox on the Windows Event Logging type Event Source.

    Event sources are a poor solution for generate alerts, though it is very desirable that they can.  I have requested for a long time there be a way to correlate events via a key extracted from the event so you know it is the same event (this is trivial with many event solutions, including the incredibly awesome FOSS SEC tool).  Among other things, you cannot even ACK an event effectively since the next run is a brand new result, but the email instructions still list ACK as an option and our clients believe it works.

    I think the only reasonable solution is to redo the code into a datasource, like originally discussed in this thread.

  3. OK, cool.  It has only been there for years now, so I am sure it will be reviewed very soon :).  More seriously, I have requested from the LM Exchange developers before an "escalate" button to get more attention on these, but so far I think there is no reliable process for getting code approved.

    • Haha 1
  4. 2 hours ago, Shack said:

    Mind sharing yours?  I think I may have a need for this and I am licensed for LMConfig

    It was published to the Exchange as H4T9GH, but it is basically what LM support provided with some tweaks.  As an Event Source, it has the same poor behavior as all Event Sources, that is, you cannot practically ACK them, only add SDT.  It also is not universal since there are different ways to get this info on different platforms.

    I like the idea of converting to a DS version with instances like the first post mentioned, and of course we are all still waiting for that promised core LM release real soon now :).

    • Like 1
  5. Please! I created a facility for this years ago with Nagios via callbacks in our notification template processor (actual templates with conditionals, etc.), but that would be tricky here. You need basically a trigger to run callbacks or similar here when alerts fire with the results placed into a token.  My guess is it won't happen unless it can be monetized somehow :(.

    • Like 1
  6. Something that strikes me as I delve into this further is that filter syntax is largely undocumented.  There are examples in the legacy docs, nothing really in Swagger.  I have figured it out mostly via trial and error and the occasional support ticket. It is clear you can use AND logic with comma-separated components, but is is not clear if you can reference the same LHS multiple times.  The is really no indication you can implement OR via filter short of glob expression matching.  The only documentation I can find on filters specifically relate to limitations added for embedded special characters in the v2 API.  Perhaps the API team could document the various common parameters in Swagger or elsewhere?

  7. I was thinking of was that the filter is not valid -- you cannot match only on values.  Well you can, but it is then detached from the property name and could match many properties.  You need to match on name and value together.,customProperties.value:PROPVALUE

    The /device/groups idea is a good one if you are not matching on wildcards, like in this case (though you could use two passes to get an ID list, then iterate).  We have found the sometimes that is necessary due to lack of endpoints (e.g., there is no direct way to map a device datasource instance ID back to the datasource ID), but if you can use one query to do your work you should try to do so.

  8. The good news is it seems dev has finally released new modules that are more configurable, but I have not looked at how much complexity was shifted to propertysources and how maintainable that will end up being.  They tied to ssh.user/ssh.pass still, though, so you will still run the risk of incurring costs unexpectedly if you use those for non-LMConfig reasons (like errdisabled port detection).  I think it is possible to disable LMConfig modules in the subtree alert tuning, though, so that may mitigate the risk.

    OSS tools do a much better job than LM did previously, hoping this brings some parity (and fixes for non-change thrashing we see all the time).  I would never dream of editing a 1200+ line module and then have to merge changes into updates later.

  9. 8 minutes ago, Brandon said:

    That being said - if I'm understanding you correctly, it sounds like the solution I posted above might negate the need for the API accounts you mentioned.  So if you migrate your primary API functions to use the native java library, you can shut off the API keys entirely and just let the collector do its thing.  I'm no security expert so please correct me if my assertion is incorrect.  

    In any case - please reply to this post if you start playing around with the library.  I'm really curious if it works well for others.  So far it's been a lifesaver for me.  



    Excellent point -- the other functions we need could likely be satisfied with an API user having manager and not admin role.  I will see if we can leverage the library to avoid needing an admin API user -- thanks!  That said, it should still be possible to bind an allowlist to any API user to limit the attack surface.  I as well can dream...

    • Like 1
  10. I have been aware of the debugger method for some time -- was not familiar with the secret debugger library, but you can access the debugger similarly via the API.  So.... sleep well knowing that any set of leaked admin API keys could expose your entire network to remote attack via arbitrary PowerShell scripts executed via the debugger API.  I was forced at the time to use that method to set the system.ips list to fix NetFlow ingestion for Palo Alto firewalls at the 5000 series or higher.  No alternate method of binding device NetFlow export has yet been provided.

    Recognizing how dangerous this was, I asked about having certain API calls like this locked to an allow list, but that went nowhere.  I have also tried changing Windows collector service accounts to use the Performance Monitoring group rather that Domain Admins (especially after the SolarWinds hack), but I found too many things broke so had to move back.  Even today well after the damage done during the SolarWinds hack due to lateral movement from compromised servers, LM collector installation instructions still include "If this Collector is monitoring other Windows systems in the same domain, run the service as a domain account with local administrator permissions."  

    Tick, tick tick...

    • Like 1
  11. No, you are correct -- datasources store time series numerical data only. Various datasources tie themselves into knots trying to workaround this limitation via datapoint legends.  I recommended a while back adding a per-datapoint enum facility so those could be properly displayed in charts as meaningful strings, especially since legends sometimes get so long and don't wrap that you literally have to open the DS source code to find out what a value means. I never saw even a peep from LM on that sensible fix, sadly.

  12. Typically this is done via autodiscovery, but if you add manual instances you can manually define properties for the instances. For the AD method, each instance is generated with the normal fields followed by an optional list of property/value strings.  Assuming AD is run often enough, those strings should be current (more or less) for reference in custom alert messages via unconditional token substitution.

    You can also use PropertySources to add auto properties if you want to do that without editing an existing datasource.

    If you need examples, Arista_Sensor_Fans is one of many datasources that generates auto properties.  Or, look at almost any PropertySource module.

    I would not add any manual properties to automatic instances as those would likely vanish at some point.

    • Like 1
  13. Sure, you can use Service Insight for this, but it is a premium feature, which is using an expensive mallet to handle something that should be available without that extra cost.  Or, there should be a Service Insight light for this stuff, leaving the costly part for the intended enhanced features of Service Insight (like Kubernetes).

    My recommendation on this was to extend cluster alerts so you could at least match up instances.  My use case at the time was to detect an AP offline on a controller cluster.  There is no way to do this without SI, which as you say is complex, and it is an extra cost.  We need stuff like this in the base product.

    • Like 1
  14. There is a way to do this, but it is not well-documented and there is no UI exposure for "No Data" alerts, you have to dig around the module sources to find them (because it is very hard to put an indicator in the alert tuning thresholds I guess). 

    We have standard alerts on 2 datapoints that have No Data alerts and no other alerts.  The first is for "Host Uptime" -> SNMP_HostUptime_Singleton -> Uptime and the second is for Uptime- -> * -> UpTime.  If a host stops responding to SNMP, those will trigger.  We keep them near the end of our alert policy to generally report to our team across all clients.

    • Like 2
  15. What you want is a dynamic template processor, but all we have is simple token substitution and no indication that will ever change (I have asked repeatedly for years).  You can route alerts to an external integration, which is how we handle transformation of tokens, but you lose some stuff when you do that depending on how you integrate.  For example, we use external email integration into our ticket system with a filter that handles the transformation, but custom email integrations do not get the same handling as builtin email for certain things (e.g., you do not get ACK or SDT notices).

  16. My two cents -- I gave up on using syslog and most other eventsources a long time ago due to lack of basic correlation features. At the time, Cisco logs weren't even parsed correctly in our client environments and it took forever to get that dealt with.  We now use SumoLogic for log processing since then since we can run queries on the data over time and get meaningful results (and if needed, tie to LM via the SumoLogic API).  LM also realized the existing stuff was a bit limited so bought a company and added LMLogs as a premium addon.  That is fine, but adding some basic ability to correlate "regular" events (even just counts over time based on custom cross-event ID extraction) should be included in the base license.  We still use it for Windows event logs to have the extra info visible, but we always have to warn folks not to bother ACK'ing any that generate email since that does nothing meaningful.  I have asked that the ACK functionality be removed for eventsources as well (SDT still is helpful).

  17. 1 minute ago, Todd Theoret said:

    Thank you Stuart. Any chance there is a potential solution being developed which would allow....manually tagging....devices?

    So this has been an issue for us a lot -- everything was tossed into the topology umbrella for alert suppression with no easy way to manually create dependencies.  There are many topologies that are simply not discoverable, like multipoint/mesh WAN topologies and really anything not handled by topology sources. 

    The good news is that some kind support tech provided me a Manual_Topology module that linked various devices manually that eluded auto-discovery.  The bad news is it is awkward and leverages hardcoded device names and MAC addresses.  But, it is possible.  IMO the UI and/or API should be extended to support manual links.  It is a last resort of course, but there are common cases where it is the only resort.

  18. Check Mike Suding's blog page -- lots of cool stuff, including this. A bit old, but probably still works :).

    As far as the debugger, yeah -- that stuff freaks me out a lot given that LM more or less requires Domain Admins on collectors (really should be Performance Monitoring Users, especially after the recent SolarWinds incident).  You can run those debugger commands from the API as well, even more scary.

  19. I 100% agree this is needed -- we have to hack around this all the time with escalation chains that have one or more empty stages, and still that does not prevent alerts from registering in the system.  But this is just one case that would be trivial to solve with DS inheritance, something I have been pushing for well over four years now. The issue with creating new DSes is they are then freestanding clones, meaning each must now be maintained independently (and this is commonly pushed by support as a solution, sadly).  If we could just get inheritance done (not just for DSes, but that would be the highest impact) it would be easy to make a copy that does what you want with changes only to parameters you desire while still getting the benefit of updates on the parent module and minimal maintenance requirements.  It would be important that child module applies-to expressions are automatically excluded from the parent chain, too.

    A related change for alerts that would not be solved by inheritance but I had also benefited from in our previous tool is threshold calculation over time.  For example, I don't care if CPU is high on a Windows server for a few minutes, but I do care if it is high for an hour. I also need to know if the average is high over a period of time when the actual level may be oscillating during that period and LM would not generate any alerts otherwise).  With Nagios we did this by calling back to the pnp4nagios RRD data to calculate averages, slopes, etc.  This could be done in LM if using the API from within modules was supported properly, but I refuse to go there until there is library support within the module system.

    • Like 1