mnagel

Members
  • Content Count

    484
  • Joined

  • Last visited

  • Days Won

    81

Everything posted by mnagel

  1. Yes, for Windows events you can do this -- we do as well. You lose the event detail, but it can alert only if N events in a window are seen (something customers ask for often). Even then, since the "collect every" value is not visible to the script, you have to take special care to ensure your event scan window and the collect every value are in sync. And this does nothing for any other type of event -- we have to use Sumo Logic (or other similar tools, like Graylog, etc.) to solve this problem in general.
  2. We usually bind this check to a DC, so I don't think I have run into that before. I definitely have for other stuff, like DHCP. Can't hurt to warn folks in the technical note section!
  3. I am a big believer in having the computer do the work for me, but for website checks humans are forced to do a lot of unnecessary work to find the cause. Generally, we set overall to critical and individuals to warning. My request is that there be a method to also send the individual alerts when the overall triggers. In the absence of a dynamic templating system, the best option I can think of here is to have a way to elevate individual contributors to critical when the overall status triggers. Or, change the behavior so that the individual details are added to the generic overall status
  4. I was not actually thinking that, but sure, I really would like one! Seems like a way to possibly do some of the cross-device correlation I have been wishing I could do. Just would not dream of touching without docs....
  5. So you know what I am going to ask next, right? What is CollectorDb and where is that documented? Feels a bit unfair to us poor mortal developers :).
  6. Are you sure about the batchscript limitation? Because the new SNMP_Network_Interfaces DS supports subrate ifSpeed designation as ILPs and is batchscript (we know this because it blew out several collectors due to default batchscript thread counts). I have not looked at the guts yet, so perhaps I am off track there.
  7. I found only one instance of WMI.open() in the datasource set we have loaded -- in Citrix_XenApp_UserExperience: def session = WMI.open(hostname); def active_apps = session.queryAll(namespace, "SELECT * from Citrix_Euem_ClientStartup", 15); The implication is it would have to handle wmi.user/wmi.pass behind the scenes. All other references are for WMI.queryAll and WMI.queryFirst with the same implication. If none of the modules properly handle those properties, then there are a lot of broken modules: [mnagel@colby datasources]$ egrep -l 'WMI\.(query|open)' * Citrix_XenApp_
  8. I would say if the instance can be uniquely identified with data on hand (as described above), then the datasource should be using that as the instance wildvalue, not some arbitrary other thing that could cause excess instances due to customer action or anything similar. As far as data retention, I have found that decisions are often made that lead to loss of data and it is distressing. I just had a case where I pointed out that a datapoint label had a typo. Fixed, but the fix kills all old data for that datapoint. Why must the label be the index rather than a label tied to a persisten
  9. To add to this -- there is in fact a pair of datapoints in the new SNMP_Network_Interfaces datasource that could be used to detect aggregate speed problems, a proxy for channel member loss (inInterfaceSpeed and outInterfaceSpeed, set to the reported speed unless overridden by ILPs). To use this, you would either have to set a threshold on the specific datapoints needed and deal with generic alerts, or add a virtual datapoint with a proper alert message, making future module updates painful. I really wish that the F/R I posted years ago to allow for LogicModule inheritance was implemented...
  10. Yes, I also pull data from ConfigSources to check into Git repos with post-commit triggers to send email. That is how we can find out what actually has changed (and we have had to add a decent chunk of post-processing in some cases as the ConfigSources often suck in ephemeral updates, thrashing the changelog). In this case, I really don't want to pull so much of something that should be handled within LM into external script processing. My method works fine now, just unfortunately you cannot send ILP tokens into PowerShell collection scripts so the list of groups must be hardcoded.
  11. Yeah, we ended up having to pay extra for SumoLogic, but could be anything. Still would be nice to have the barest level of correlation so you could effectively ACK events.
  12. There is no Cisco MIB for etherchannel that I have found (in general -- may be something for some specific device types). Same for other aggregates, like PPP multilink. In our experience, the only safe way to detect aggregate issues is to monitor the reported speed of the bundle as the underlying link does NOT necessarily need to be down for an aggregate to lose members (seen it happen when carriers do testing and fail to put the member back into the bundle). Unfortunately, despite previous requests, the interface datasource still does not report ifSpeed so you cannot set a threshold on tha
  13. Oh yes, now I recall why I did not consider LMConfig. Changes can not be sent (unless you use an external API script), so you just get the red light alert ("something changed"). My current alert message includes the current and expected list. The alert communication channel in general is fairly small and I hate to constrain it further.
  14. Writing to a file is more or less how we used to do it with Nagios (technically, it was a File::Cache object) but that was with a central Nagios server + gearman and those checks always ran on the central server. With LM, this would be suboptimal in the face of collector pools (coming, one day) or collector failover. IMO, the right for that is a distributed key/value store. As for LMConfig -- it is a premium feature that we don't by default push our clients to have to add to their cost structure to achieve things that should be possible without it. In this case, you could get away with
  15. In our previous life, we had written a Nagios plugin to check whether a sensitive Windows group had changed (e.g., Domain Admins). I created a replacement for this within LM, but since we can't really keep track of deltas without a key/value store, we use a property for each group that specifies the expected members, which should be updated when membership changes intentionally. We also use a property to list the groups for AD so we can store useful ILPs, but since those ILPs are not passed to the collection script (they could be, just are not currently passed for Powershell), the list of gr
  16. Posted new version with datapoint messages and revised out of the box thresholds. Did not change AD or collection code, but it still shows pending review. PE9KPD
  17. Thanks! I need to make one more pass on it to enable custom alert messages. I added two different virtual datapoints so messages can say "expired XX ago" versus "will expire in XX". Looking forward to one day being able to just check stuff when alert messages are actually handled by template processors :).
  18. Got it -- that is subtle! New version posted -- P3GXE7
  19. Thanks! And, we shall see :). I stepped away for now on the whole data collection timeout thing to clear my head. Feels like it is LM causing it, but can't see how. I based the general structure on the "_Windows patches needed" DS Mike Suding wrote. Also tried the PSSession avenue other DSes use, but made no difference. Same exact code run from the collector PS shell returns data quickly. Hopefully we can figure it out -- this information is otherwise hard to get from external tests.
  20. One thing I have noticed over time is how often I find that there is a datapoint somewhere that really deserves to be included in an alert rule, but you just don't know this until after you get bitten. This issue is orthogonal to threshold severity, at least as far as some of the modules I have seen. An example fresh from today was loss of power in an environment with no indication it happened. After some checking, found the Cisco FRU Power DS had an update and afterward showed when power loss (and other related issues) happened. Whoever wrote this one decided each class of issues would be
  21. Please add org.tinyradius.util.RadiusClient or similar so we can create RADIUS protocol checks. Yes, I know I can workaround this by deploying JAR files, but since that is not a process managed by LM, it is problematic to deal with (we cannot use puppet or similar on _every_ collector system).
  22. I have written a DS that uses PowerShell to discover any SSL Certificate within the Windows certificate stores and generates alerts for those expiring soon and for those that have already expired. The alert messages are still generic as I am fighting a weird timeout issue with the data collection code against remote devices. The AD code works fine and the data collection code is virtually identical, simpler in fact as we have the serial number on hand. If I run it from the collector itself in a PS console, it also works fine. Just seems to go to lunch when run from within LM itself. If an
  23. OK, thanks. These all were shared here, so not sure why the became private. Will check...
  24. Sounds great, will check it out! I wrote something similar for interfaces way back for similar reasons -- LM limitations and no clear focus on nuts and bolts stuff like this. My largest remaining annoyance on that is we use patterns via properties that examine the interface description so we can disable monitor and/or alert for interfaces based on description patterns (e.g., "Workstations"). However, because AD will skip interfaces that are operDown you cannot change the description after it is down and have the script make the necessary change. You could change the DS to discover operDown i