mnagel

Members
  • Content Count

    494
  • Joined

  • Last visited

  • Days Won

    87

Posts posted by mnagel

  1. Thanks! I need to make one more pass on it to enable custom alert messages. I added two different virtual datapoints so messages can say "expired XX ago" versus "will expire in XX". Looking forward to one day being able to just check stuff when alert messages are actually handled by template processors :).

  2. Thanks!  And, we shall see :).  I stepped away for now on the whole data collection timeout thing to clear my head.  Feels like it is LM causing it, but can't see how.  I based the general structure on the "_Windows patches needed" DS Mike Suding wrote.  Also tried the PSSession avenue other DSes use, but made no difference.  Same exact code run from the collector PS shell returns data quickly.  Hopefully we can figure it out -- this information is otherwise hard to get from external tests.

  3. One thing I have noticed over time is how often I find that there is a datapoint somewhere that really deserves to be included in an alert rule, but you just don't know this until after you get bitten.  This issue is orthogonal to threshold severity, at least as far as some of the modules I have seen.  An example fresh from today was loss of power in an environment with no indication it happened.  After some checking, found the Cisco FRU Power DS had an update and afterward showed when power loss (and other related issues) happened.  Whoever wrote this one decided each class of issues would be warning level only even though the DP classes themselves are warning, error and critical, grouping different conditions within each.  What I came away from with this was that LM itself should have a diagnostic capability to (among other things) recommend which datasources represent important things that ought to have alert rules but do not (or route to NoEscalation). I am not sure yet on how this out to be represented in the system, but some indication of "this one is important and should route to an alert!" in each datapoint would be a good start.  It may be there is more metadata that deserves to be included, but nothing else pops into my head right now.

  4. Please add org.tinyradius.util.RadiusClient or similar so we can create RADIUS protocol checks. Yes, I know I can workaround this by deploying JAR files, but since that is not a process managed by LM, it is problematic to deal with (we cannot use puppet or similar on _every_ collector system).

    • Upvote 1
  5. I have written a DS that uses PowerShell to discover any SSL Certificate within the Windows certificate stores and generates alerts for those expiring soon and for those that have already expired.  The alert messages are still generic as I am fighting a weird timeout issue with the data collection code against remote devices.  The AD code works fine and the data collection code is virtually identical, simpler in fact as we have the serial number on hand.  If I run it from the collector itself in a PS console, it also works fine.  Just seems to go to lunch when run from within LM itself.  If anyone wants to take a look and see if they can find the problem, that would be much appreciated -- my intent is to polish it up and release it publicly.  It is in code review, not clear how long that will take with the new LMExchange feature.

    2YPMLN

  6. Sounds great, will check it out! I wrote something similar for interfaces way back for similar reasons -- LM limitations and no clear focus on nuts and bolts stuff like this.  My largest remaining annoyance on that is we use patterns via properties that examine the interface description so we can disable monitor and/or alert for interfaces based on description patterns (e.g., "Workstations").  However, because AD will skip interfaces that are operDown you cannot change the description after it is down and have the script make the necessary change. You could change the DS to discover operDown interfaces always, but that would be ridiculous in general.  I filed a F/R to allow updating some fields (like the description) for instances already discovered, but AFAIK it has received no attention.  Just added to our repo under GPL: https://github.com/willingminds/lmapi-scripts

  7. Sometimes it would be nice to create section labels within dashboards.  For example, since you cannot show multiple website ping statistics in one graph, I group them by columns and would like to add a column label.  You can add a text widget, but it takes way too much vertical space.  Similarly, may be nice to have sections with boundaries and background color fill to visually group various widgets.

    • Upvote 1
  8. I never really received any useful information, but there are two different things that are not documented:

        * MEMCACHE method in datasources
        * memcache JAR in the LM agent/lib directory

    For the latter, I did figure out it is this one by examining the ToC: com.danga.MemCached

    I have had more difficulty tracking down official documentation on that than I would expect (though I am sure it is out there).  I have found some useful examples, so will see if I can actually leverage the library.

    https://sites.google.com/site/networkprogrammingforjava/home/miscellaneous/others/memcached

  9. Yes, that would help.  Still not same as saved searches as we can never delegate creation of dynamic groups to users and dynamic groups can only be defined for resources, not for websites or collectors.  Allowing creation of taxonomic groups (not related to security) would help as well as long as RBAC could allow managing members without granting special resource access or allow manipulation of the group itself.  It is just one of the nuts and bolts issues that create walls for us all the time. Like UI lists randomly lacking search functions or not being able to avoid alerts for clustered resource instances (e.g., HA interface pairs) without coding each case in the DS (or in some cases use of API to auto-ACK partner instances leveraging properties).

  10. Thinking about this more, what is needed is a facility to define saved searches along with the ability to apply actions (like SDT) to those searches (or to multiple items in general).  OSS tools like Thruk do this very well (though I don't recall it supported saved searches last time I used it, but you could reference a search as a permalink IIRC).  In my experience with LM to date, nuts and bolts operational issues like this that impact everyone get too little attention as the more sexy things like Kubernetes and cloud monitoring seem to draw all the development resources.  I really would like to see some more focus on nuts and bolts issues.

  11. We thought we came up with a trick to deal with letting our clients manage maintenance on many different devices.  The idea was, create a group they can manage and let them add those devices to the group, then schedule maintenance and update as needed.  Alas, RBAC prevents this, primarily because it lacks the ability to distinguish using groups for grouping from using groups for security.  Because the users don't have manage on the devices (intentionally), they cannot add them to a group.  If we could allow them to add to a non-security group, it would potentially fix this.  I'm sure other options to be added by LM might work, perhaps better. This one was already somewhat concerning as granting manage to the maintenance group meant they could potentially delete the group by accident.  I understand why this is broken under today's semantics, but we need a group mechanism that works as intended for this or a suitable alternative.

  12. We found out the hard way this past weekend that the current NetApp DS suite is missing a crucial check for cluster member health.  You can deploy it from here (once the code is reviewed -- hopefully quickly as it is just a clone with a different query and datapoint set).

    FJTRGL

  13. Please add HtmlUnit (http://htmlunit.sourceforge.net/) to allow for more sophisticated web page interaction, including JavaScript evaluation. Right now any pages with JavaScript are impenetrable.  I know it is in theory possible to add this ourselves, but it is a painful process that must be done on each collector.  Much easier if included to begin with.  Or please add a facility to allow deployment of new libraries within Settings.  Perhaps not fully dynamic like Grapes, but something along those lines would be very helpful.

     

    • Like 2
  14. Just now, Stuart Weenig said:

    Wouldn't this be similar to just running an external web check in LM, but with more metrics?

    The problem with most of the web-only options is the upload speed test is either nonexistent or very inaccurate. The speedtest-cli client is not supported and does not use the same protocols normally used by Ookla tests.  I have not tried the SpeedOfMe API yet, so not sure if it is better or worse, but would hope they account for running it similarly to interactive conditions.  I also just found Ookla now has a supported CLI option, which looks hopeful.  Seems like it needs a license for commercial use, though.  https://www.speedtest.net/apps/cli

  15. I don't think so, but perhaps.  global status should have current statistics and variables should have static settings.  Could be wrong, of course. 

    As far as the DP versus property issue, yeah, I have run into this before.  We wanted to be able to reference a property in widgets and you simply can't, so we had to create a datapoint from the property (this is to be able to display resource commit usage for our clients).  So, we have a DS applied to collectors that only has complex groovy datapoints with the value ##property## for each.  Rough edges :).

  16. Or, you could not add SSH overhead and just use a query :).

    MariaDB [(none)]> show variables like '%conn%';
    +-----------------------------------------------+-----------------+
    | Variable_name                                 | Value           |
    +-----------------------------------------------+-----------------+
    | character_set_connection                      | utf8            |
    | collation_connection                          | utf8_general_ci |
    | connect_timeout                               | 10              |
    | default_master_connection                     |                 |
    | extra_max_connections                         | 1               |
    | init_connect                                  |                 |
    | max_connect_errors                            | 100             |
    | max_connections                               | 151             |
    | max_user_connections                          | 0               |
    | performance_schema_session_connect_attrs_size | -1              |
    +-----------------------------------------------+-----------------+
    10 rows in set (0.00 sec)

    Confirmed via same user with GRANT USAGE as normally required for "show global status" results.