• Posts

  • Joined

  • Last visited

  • Days Won


Everything posted by mnagel

  1. The main metric you would care about is related to time since last TCN per VLAN. I have a DS that will get this for the default VLAN, but had a lot of trouble with any others due to lack of context support in LM. I have been told there is context support in the 28.500 collector and later, but have not yet had a chance to test. I published what I have for now as 969G49, but the new version will have to be done as a Groovy script to leverage the context feature.
  2. The strings are host properties, so set them on the collector you want to run this from. Those would have to be bound to a collector host. As written, that supports only a single remote SFTP test. If you wanted to do more, you would need to rewrite that to handle instances either manual or active discovery. I do the latter often even with a manual property list as it is the only way to define automatic instance properties. It may be possible to do this via an internal "website" check, but I have not tried going full groovy on those yet :). Even then, each would be a separate copy of code, so better to use this and perhaps extend to support multiple checks.
  3. This is simple enough, except there are two issues: the default alert subject is annoying and confusing to clients -- it is easily fixed, but.... recovery alerts don't use the custom alert subject so you still get the annoying version for recoveries (unless you need them to ensure ticket closure). Luckily, for reboots you don't usually want a recovery alert -- just one alert to let you know it happened. Our catchall is this, with specific versions for each client: Our custom version of the alert subject is "##HOST## Rebooted ##VALUE## seconds ago"
  4. For SNMP faults (assuming that is the issue here), we have some standard rules in all the clients we manage: This seems to do the trick most of the time, but I am sure there are cases were are still missing. Those datapoints have "No Data" enabled and are not used for thresholds otherwise, making them unambiguous.
  5. Right. "no data" should be on par with critical, error and warning so the threshold can be overridden properly if needed for specific devices/groups. It is very hard to know without digging into each DP definition when "no data" will even alert (no indication in the tuning page) and it is definitely unclear when you can unambiguously check (most times at least, the "no data" status applies to a datapoint that otherwise has no competing alert threshold). I also run into embarrassing situations where data acquisition has silently failed due to collection faults -- LM has been adding more troubleshooters to help there, which is appreciated. Still, I just ran into an issue with the "new/improved" Cisco WLC datasource group where APs that are disconnected don't even have a "no data" DP anymore, so you cannot know this happened. In fact, that DS now removes dead instances immediately. Realized this weekend as we moved from Cisco WLC to Meraki that not a single alert generated from WLC APs as they were disconnected -- they just vanished from existence. Technically true, but should be in alarm until confirmed intentional. The only way to fix this now is to edit the DS, which I loathe doing as it (for now) severs my tie to the original and creates a risk that an update will break changes. I am not a fan of cloning as it also severs the tie to the original and makes updates painful.
  6. I intend to extend this to include more track-worthy account attributes such as Expired, Locked, etc, but to start I wanted to enable expiration tracking for domain admin accounts as we can get caught off-guard on those when they happen unexpectedly. This involved creating a new propertysource that tags domain controllers with one or more categories tied to their FSMO roles, then for the PDCEmulator role (arbitrarily chosen, mainly wanted to pick just one), scans the Domain Admins group list and reports days until expiration. No graphs or thresholds yet, will be extending soon. May also genericize a bit and use an input property for the list of groups to include (with a default of Domain Admins). DataSource: YDYFXH PropertySource: 2JNTAL
  7. We export pretty much everything regularly via the API and check into a git repo. This allows us to track changes (always fun to find LM devs testing module changes ) and more importantly, allows us to grep for things in modules. Among other things, this simplifies finding code examples for starting points on new modules. Our export/backup module does have per endpoint filters since we have found a number of fields that are ephemeral (e.g., timestamps) and must be suppressed to avoid checkin thrashing. Our current export list includes the following. There are a few other features in this, like grouping by client using a client property, saving items one per file and special formatting options to allow the results to be accessible (default output for Groovy code is one long line, making diffs painful). I could be convinced to share this one, but if you just need one thing like configsources, a specific API script for that may be easier. Regards, Mark my %BACKUPLIST = ( "/setting/alert/chains" => { file => "escalation-chains", filter => \&filter_escalation_chains, }, "/setting/alert/rules" => { file => "alert-rules", filter => \&filter_alert_rules, }, "/setting/alert/dependencyrules" => { file => "alert-dependencyrules", # filter => \&filter_alert_dependencyrules, }, "/setting/admins" => { file => "users", filter => \&filter_users, }, "/setting/admin/groups" => { file => "user-groups", }, "/setting/datasources" => { file => "datasources", ## metadata => "/setting/registry/metadata/datasource", # no way to get this info currently saveoptions => { mode => "oneperfile", format => "dumper" }, }, "/setting/batchjobs" => { file => "jobmonitors", }, "/setting/eventsources" => { file => "eventsources", saveoptions => { mode => "oneperfile", }, }, "/setting/configsources" => { file => "configsources", saveoptions => { mode => "oneperfile", format => "dumper" }, }, "/setting/propertyrules" => { file => "propertysources", saveoptions => { mode => "oneperfile", format => "dumper" }, }, "/setting/topologysources" => { file => "topologysources", saveoptions => { mode => "oneperfile", format => "dumper" }, }, "/setting/functions" => { file => "appliesto-functions", }, "/setting/oids" => { file => "sysoid-maps", }, "/setting/roles" => { file => "roles", }, "/setting/role/groups" => { file => "role-groups", }, "/setting/recipientgroups" => { file => "recipient-groups", }, "/setting/integrations" => { file => "integrations", }, "/setting/netscans/groups" => { file => "netscan-groups", }, "backup_netscan_policies" => { file => "netscan-policies", backup => \&backup_netscan_policies, filter => \&filter_netscan_policies, }, "/setting/collectors" => { file => "collectors", filter => \&filter_collectors, }, "/dashboard/dashboards" => { file => "dashboards", }, "/dashboard/widgets" => { file => "widgets", }, "/dashboard/groups" => { file => "dashboard-groups", }, "/device/devices" => { file => "devices", groupby => $CLIENTPROP, filter => \&filter_devices, }, "/device/groups" => { file => "device-groups", backup => \&backup_device_groups, groupby => $CLIENTPROP, filter => \&filter_device_groups, }, "/service/services" => { file => "services", groupby => $CLIENTPROP, filter => \&filter_services, }, "/service/groups" => { file => "service-groups", groupby => $CLIENTPROP, filter => \&filter_service_groups, }, "/report/reports" => { file => "reports", filter => \&filter_reports, saveoptions => { mode => "oneperfile", nameformat => '%name%.%groupId%', }, }, "/report/groups" => { file => "reports-groups", }, );
  8. The datasource provided by LM does not handle superscopes (multiple subnets per scope). I wrote one that does handle superscopes and works properly. I don't have a way to monitor across split scopes (portions handled by different servers), but I think that if one of those independently was filling up you would still want to get an alert so it should not matter. With superscope monitoring, you will know only when all the subnet IPs are running out. I will see if I can get that one published in LM Exchange.
  9. Yes, for Windows events you can do this -- we do as well. You lose the event detail, but it can alert only if N events in a window are seen (something customers ask for often). Even then, since the "collect every" value is not visible to the script, you have to take special care to ensure your event scan window and the collect every value are in sync. And this does nothing for any other type of event -- we have to use Sumo Logic (or other similar tools, like Graylog, etc.) to solve this problem in general.
  10. We usually bind this check to a DC, so I don't think I have run into that before. I definitely have for other stuff, like DHCP. Can't hurt to warn folks in the technical note section!
  11. I am a big believer in having the computer do the work for me, but for website checks humans are forced to do a lot of unnecessary work to find the cause. Generally, we set overall to critical and individuals to warning. My request is that there be a method to also send the individual alerts when the overall triggers. In the absence of a dynamic templating system, the best option I can think of here is to have a way to elevate individual contributors to critical when the overall status triggers. Or, change the behavior so that the individual details are added to the generic overall status alert. Or something else -- I just don't want to have to (or tell my clients they have to) dig through alerts in the UI to figure out what is wrong.
  12. I was not actually thinking that, but sure, I really would like one! Seems like a way to possibly do some of the cross-device correlation I have been wishing I could do. Just would not dream of touching without docs....
  13. So you know what I am going to ask next, right? What is CollectorDb and where is that documented? Feels a bit unfair to us poor mortal developers :).
  14. Are you sure about the batchscript limitation? Because the new SNMP_Network_Interfaces DS supports subrate ifSpeed designation as ILPs and is batchscript (we know this because it blew out several collectors due to default batchscript thread counts). I have not looked at the guts yet, so perhaps I am off track there.
  15. I found only one instance of in the datasource set we have loaded -- in Citrix_XenApp_UserExperience: def session =; def active_apps = session.queryAll(namespace, "SELECT * from Citrix_Euem_ClientStartup", 15); The implication is it would have to handle wmi.user/wmi.pass behind the scenes. All other references are for WMI.queryAll and WMI.queryFirst with the same implication. If none of the modules properly handle those properties, then there are a lot of broken modules: [mnagel@colby datasources]$ egrep -l 'WMI\.(query|open)' * Citrix_XenApp_UserExperience LogicMonitor_Collector_TotalCPUMemory Microsoft_Exchange_ActiveDirectoryDomainControllers_2016+ Microsoft_Exchange_EdgeTransportDatabaseInstances_2016+ Microsoft_Exchange_EdgeTransportDatabases_2016+ Microsoft_Exchange_MailboxDatabaseInstances_2016+ Microsoft_Exchange_MailboxDatabases_2016+ Microsoft_Exchange_MailboxOverview_2016+ Microsoft_Exchange_Replication_2016+ Microsoft_Exchange_TransportQueueOverview_2016+ Microsoft_Exchange_UnifiedMessaging_2016+ Microsoft_LyncMediationServer_Stats Microsoft_LyncServer_AccessEdgeServerStats Microsoft_LyncServer_Authentication Microsoft_LyncServer_BackupCentralManagementModule Microsoft_LyncServer_ClusterManager Microsoft_LyncServer_ConferencingAttendant Microsoft_LyncServer_Datastores Microsoft_LyncServer_EmergencyCallRouting Microsoft_LyncServer_InstantMessaging Microsoft_LyncServer_MCU Microsoft_LyncServer_Messages Microsoft_LyncServer_Networking Microsoft_LyncServer_Protocol Microsoft_LyncServer_Routing Microsoft_LyncServer_RoutingApps Microsoft_LyncServer_StorageService Microsoft_LyncServer_WebServices Microsoft_LyncServer_XMPPProxy WinCPU Win_WMI_Access_Denied_ErrorCodes Win_WMI_UACTroubleshooter
  16. I would say if the instance can be uniquely identified with data on hand (as described above), then the datasource should be using that as the instance wildvalue, not some arbitrary other thing that could cause excess instances due to customer action or anything similar. As far as data retention, I have found that decisions are often made that lead to loss of data and it is distressing. I just had a case where I pointed out that a datapoint label had a typo. Fixed, but the fix kills all old data for that datapoint. Why must the label be the index rather than a label tied to a persistent index? I see similar problems for DS replacements. I suggested in a F/R long ago that it be possible at DS load time to upgrade from the previous DS version. I fully appreciate that new datasources with alternate structures should be created, but if there was a migration function you could select the datapoint mapping to avoid losing data (currently best option is to run both in parallel until you get enough new data to not look foolish to your clients). Preferably this would be builtin to the new datasource, so it would happen automatically or at least could provide guidance. That sort of mechanism could also handle my typo'ed datapoint issue. Nuts and bolts stuff like that is hard to market, though :(.
  17. To add to this -- there is in fact a pair of datapoints in the new SNMP_Network_Interfaces datasource that could be used to detect aggregate speed problems, a proxy for channel member loss (inInterfaceSpeed and outInterfaceSpeed, set to the reported speed unless overridden by ILPs). To use this, you would either have to set a threshold on the specific datapoints needed and deal with generic alerts, or add a virtual datapoint with a proper alert message, making future module updates painful. I really wish that the F/R I posted years ago to allow for LogicModule inheritance was implemented... If you use SNMP_Network_Interfaces in any real world environments, be sure your collectors have been tuned to allow more batchscript threads!
  18. Yes, I also pull data from ConfigSources to check into Git repos with post-commit triggers to send email. That is how we can find out what actually has changed (and we have had to add a decent chunk of post-processing in some cases as the ConfigSources often suck in ephemeral updates, thrashing the changelog). In this case, I really don't want to pull so much of something that should be handled within LM into external script processing. My method works fine now, just unfortunately you cannot send ILP tokens into PowerShell collection scripts so the list of groups must be hardcoded.
  19. Yeah, we ended up having to pay extra for SumoLogic, but could be anything. Still would be nice to have the barest level of correlation so you could effectively ACK events.
  20. There is no Cisco MIB for etherchannel that I have found (in general -- may be something for some specific device types). Same for other aggregates, like PPP multilink. In our experience, the only safe way to detect aggregate issues is to monitor the reported speed of the bundle as the underlying link does NOT necessarily need to be down for an aggregate to lose members (seen it happen when carriers do testing and fail to put the member back into the bundle). Unfortunately, despite previous requests, the interface datasource still does not report ifSpeed so you cannot set a threshold on that datapoint to detect out-of-spec speeds.
  21. Oh yes, now I recall why I did not consider LMConfig. Changes can not be sent (unless you use an external API script), so you just get the red light alert ("something changed"). My current alert message includes the current and expected list. The alert communication channel in general is fairly small and I hate to constrain it further.
  22. Writing to a file is more or less how we used to do it with Nagios (technically, it was a File::Cache object) but that was with a central Nagios server + gearman and those checks always ran on the central server. With LM, this would be suboptimal in the face of collector pools (coming, one day) or collector failover. IMO, the right for that is a distributed key/value store. As for LMConfig -- it is a premium feature that we don't by default push our clients to have to add to their cost structure to achieve things that should be possible without it. In this case, you could get away with one per domain, so the cost increase is defensible. Will see what I can do...
  23. In our previous life, we had written a Nagios plugin to check whether a sensitive Windows group had changed (e.g., Domain Admins). I created a replacement for this within LM, but since we can't really keep track of deltas without a key/value store, we use a property for each group that specifies the expected members, which should be updated when membership changes intentionally. We also use a property to list the groups for AD so we can store useful ILPs, but since those ILPs are not passed to the collection script (they could be, just are not currently passed for Powershell), the list of groups that can be checked is restricted to what is builtin to the collection script. For one or more AD controllers then, you would specify (for example): windows.groupcheck.list: Domain Admins windows.groupcheck.spec.Domain_Admins: administrator,alice,bob If the list diverges, the datapoint for that group will alert. There is also a total count of members that is tracked, and can be used to set an alert if needed (e.g., some groups like Schema Admins should normally be empty, but that can be handled by the spec). 2Y9FM6
  24. Posted new version with datapoint messages and revised out of the box thresholds. Did not change AD or collection code, but it still shows pending review. PE9KPD