Keimond

Members
  • Content Count

    16
  • Joined

  • Last visited

  • Days Won

    2

Community Reputation

3 Neutral

About Keimond

  • Rank
    Community Whiz Kid

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Example, we have a graph showing the current number of connected users.. I have another datapoint that derives those values so I can see the number of new users that connected between polls. I want to see a bar graph over x time ( like 24 hours ) and see the total number of new connections per x time ( like 1 hour ) Here's a decent example.. Or maybe I want to stack the graph with other datapoints showing connected users per protocol or something... Thanks! David Klein
  2. Sorry @GraphTheWorld, not ignoring you! My priorities in the company have been shifted elsewhere right now and D42 isn't one of them. I'll come back and elaborate more later.. (I've also asked the rest of my team to give their input on what they'd like to see) At most, our top goal was, whatever is in Device42 labeled as production should be in LogicMonitor (unless monitoring is set to off) .. or/and the opposite.. whatever is in LogicMonitor should be in Device42.
  3. @Sarah Terry Thank you !!! That helps out a lot, that answer doesn't seem to be very straight forward for the common user. Especially when an alert is cleared. The user has to know then if they want to go see history on an instance that when they go to the alerts tab on that instance, they also have to change cleared to all to see past alerts then click on history. -- which is why a tab right next to alerts might be nice, or a visible "history" button under the alerts tab so that the common user can easily have a link to see history. At least for now it's just a matter of training everyone
  4. hey @Sarah Terry I was just search for "history" and ran across this.. I was just looking at a resource and thinking boy it would be nice if right next to the Alerts tab was a History tab (or maybe it could just show on the same Alerts tab as old alerts... basically a quick way to see has this instance alerted before and when ? Which would be great for the normal user!
  5. @David Lee Thank you for this, I might have to play around with implementing this for a few of my DS's... Along with conditional formatting I would love to see the ability to say: if threshold is < X or > than Y if threshold is > X and < Y if threshold for DP1 is < than X AND threshold for DP2 is < than Y. (right now I use a comlex datapoint and put my thresholds in there.. if dp1 is lower than X and dp2 is lower than Y return a 0 or 1 based on true/false statements... its ugly and confusing )
  6. Server1 Server2 uy1 - dns1.us1.blah.com ab1 - dns3.us1.blah.com uy1 - dns2.us1.blah.com ab1 - dns4.us1.blah.com ar1 - dns1.us1.blah.com mn1 - dns3.us1.blah.com ar1 - dns2.us1.blah.com mn1 - dns4.us1.blah.com my1 - dns3.us1.blah.com bg2 - dns1.us1.blah.com my1 - dns4.us1.blah.com bg2 - dns2.us1.blah.com All instances return back a value for a query time (from Server1/2 to dns1/2/3/4.us1.blah.com.. whichever is in the name..) All instances return back a value of 0 1 or 2 based on if they were able to query both servers.. so for example, if uy1 can't query either dns1 or dns2, that's a critcal because that site can't query either server. So I see two possible groupings.. the sites (uy1, ar1, my1, ab1, mn1, bg2) or the servers (dns1.us1.., dns2.us1.., dns3.us1.., dns4.us1..) So instead of each of my instances having to do their own scripted checking in the background to check both servers, a grouped/cluster alert could alert if both instances are down. And in the other case, if say dns1.us1.blah.com is down... I'd rather not get a page from uy1, ar1, and bg2... just one page with a custom alert saying that uy1, ar1, and bg2 are unable to contact dns1.us1.blah.com (I envision this being possible by pulling instance properties) example: uy1 - dns1.us1.blah.com would have instances properties of site = uy1 dns = dns1.us1.blah.com Flexibility in being able to customize alert messages for each cluster / groupped alert is a big one too !! Right now it's a standard template across the board
  7. Hello, we would love the ability to have the cluster alert give us a list of instance names that have triggered the cluster alert. Another is filtering based off of an instance name. Example, let's say I have 3 devices that all have 3 instances named something like Device1 --- Device2 --- Device3 a1-dns1.blah d1-dns1.blah g2-dns1.blah b6-dns2.blah e2-dns2.blah h1-dns2.blah c1-dns3.blah f1-dns3.blah i2-dns3.blah While you can see all the instances are a different name, the last part of the name is still common with some of the others (dns1.blah, dns2.blah, dns3.blah) I would like the ability to say trigger the alert if 2 or more regex groups match.. ie.. [a-z][0-9](dns[0-9].blah)
  8. Just another random note lol.. I read YAML better so I'm always using https://www.json2yaml.com/ name: test111 parentId: 111 appliesTo: startsWith("system.displayname","Prod") translates to { "name": "test111", "parentId": 111, "appliesTo": "startsWith(\"system.displayname\",\"Prod\")" } *shrug*
  9. Please include "Grouping" too We just got flooded with a bunch of pages last night because a server went down which caused instances from several datasources to go down.. OUCH! It would be great to group instances so that if one goes down.. that group pages and says hey something is wrong with this server.. then we can log in and go check.. rather than getting flooded by everything. Same idea with grouping of servers... I could call a group "HKG POP"... a server goes down that I have put in that group.. or any instance alerts a critical.. just that one group pages and says hey a critcal event has happened in the HKG POP.... and if other servers go down... it's not going to page again if it already has and hasn't been resolved. ( I can see a few gotchas in there.. but with enough configuration choices.. the company can decide how they want to handle situations )
  10. Definitely a step in the right direction for LM.. now we just need to be able to use this for stuff that goes down but doesn't have an SDT lol... I assume the same logic can be applied but that involves deeper changes than data sources I may have missed it but if it hasn't been requested, besides host dependencies, we should have service dependencies. There should also be multiple dependencies. * Host 03 depends on Host01 * Host 04 depends on Host01 AND/OR Host02 -- (and) don't alert Host04(and services) if Host01 and/or Host02 are down. -- ( or) don't alert on Host04 if just Host01 or Host02 are down. * Service07 depends on Service05 * Service10 depends on Service05 * Service08 depends on Service01 AND/OR Service02 -- (and) don't alert on service 08 if Serviec01 and/or Service02 are down. -- ( or) don't alert on service08 if just Service01 or Service02 are down. and the list could go on with multiple scenarios... * Service09 depends on Host 01 * Service12 depends on Host 01 AND/OR Host02 * Host03 depends on Serivce04 * Host03 depends on Service02 AND/OR Service06
  11. long post.. I'm posting this more as an informational; this is what we did to collect everything through snmp rather than open up any ports like ssh... If you have ideas on making it better I'm all ears! Nothing fancy.. I modified the code from the real world example that snmp pass_persist gives on their page... the main script loads stuff from a YAML config file then calls each python module, passing the snmp pass_persist object (pp) and config object over to them. The modules will run a quick test to see if they actually need to run... for example if the machine doesn't have DNS, then the dns_stats doesn't need to run. The File Monitor and DNS Stats are capable of monitoring multiple instances. Things that I want still: -- The file monitor oids will change for files if a file is added or removed.. so maybe we set LM to do autodiscovery based on a key or lookup instead ?! -- The above applies to software raids.. may need a state file to cache things in, or again.. key/lookups instead. -- dns_stats calls an external bash script to do the parsing because my python woo is very weak -- clean up / insert comments everywhere ! snmpd.conf (add this line to the bottom) pass_persist .1.3.6.1.4.1.6556 /usr/share/snmp/extensions/lm/logicmonitor_helper.py config.yml: (/usr/share/snmp/extensions/lm/) --- dns: instances: - name: 'Main DNS' config: '/etc/named.conf' stats: '/var/named/data/named_stats.txt' - name: 'Special voodoo DNS' config: '/etc/named.voodoo.conf' stats: '/var/named.voodoo/data/named_stats.txt' file_monitor: directories: - '/var/opt/logs/servers' - '/var/opt/logs/network' Main script: (/usr/share/snmp/extensions/lm/LogicMonitorHelper.py) #!/usr/bin/python -u # -*- coding:Utf-8 -*- # Option -u is needed for communication with snmpd import sys sys.path.append('/usr/share/snmp/extensions/lm/libraries') import snmp_passpersist as snmp import os, re, socket, syslog, time, errno import LM_Config as cfg import LM_Software_Raid import LM_FileMonitor import LM_dns_stats # Global vars pp = None config = cfg.loadConfig() VER = "Logic Monitor SNMP Helper v0.1" POLLING_INTERVAL = 60 MAX_RETRY = 5 OID_BASE = ".1.3.6.1.4.1.6556" def update_data(): pp.add_str('0.0',VER) ## This is needed because the variable isn't getting reset to 0 inside the modules LM_Software_Raid.doUpdate(pp) LM_FileMonitor.doUpdate(pp,config) LM_dns_stats.doUpdate(pp,config) def main(): syslog.openlog(sys.argv[0],syslog.LOG_PID) retry_timestamp=int(time.time()) retry_counter=MAX_RETRY while retry_counter>0: try: global pp syslog.syslog(syslog.LOG_INFO,"Starting Logic Monitor SNMP Helper...") # Load helpers pp=snmp.PassPersist(OID_BASE) pp.start(update_data,POLLING_INTERVAL) # Should'nt return (except if updater thread has died) except KeyboardInterrupt: print "Exiting on user request." sys.exit(0) except IOError, e: if e.errno == errno.EPIPE: syslog.syslog(syslog.LOG_INFO,"Snmpd had close the pipe, exiting...") sys.exit(0) else: syslog.syslog(syslog.LOG_WARNING,"Updater thread as died: IOError: %s" % (e)) except Exception, e: syslog.syslog(syslog.LOG_WARNING,"Main thread as died: %s: %s" % (e.__class__.__name__, e)) else: syslog.syslog(syslog.LOG_WARNING,"Updater thread as died: %s" % (pp.error)) syslog.syslog(syslog.LOG_WARNING,"Restarting monitoring in 15 sec...") time.sleep(15) # Errors frequency detection now=int(time.time()) if (now - 3600) > retry_timestamp: # If the previous error is older than 1H retry_counter=MAX_RETRY # Reset the counter else: retry_counter-=1 # Else countdown retry_timestamp=now syslog.syslog(syslog.LOG_ERR,"Too many retry, abording... Please check if xen is running !") sys.exit(1) if __name__ == "__main__": main() LM_Config.py: (/usr/share/snmp/extensions/lm/libraries) #!/usr/bin/python import sys sys.path.append('/usr/share/snmp/extensions/lm/libraries') import yaml file = '/usr/share/snmp/extensions/lm/config.yml' def loadConfig(): try: with open(file) as f: return yaml.load(f) except: print(file + " doesn't exist") LM_Software_Raid.py: (/usr/share/snmp/extensions/lm/libraries) #!/usr/bin/python import sys sys.path.append('/usr/share/snmp/extensions/lm/libraries') import re import mdstat import json import argparse import LM_Config as cfg md = mdstat.parse() def testModule(): config = cfg.loadConfig() if len(md['devices']) > 0: return True else: return False def to_bool(*args): try: if args[1] == "rev": if args[0] == True or args[0] == None: return 1 elif args[0] == False: return 0 except: if args[0] == True or args[0] == None: return 0 elif args[0] == False: return 1 else: return s def doUpdate(pp): if testModule(): # headers for each section in the for loop(s) pp.add_str('1.0',"md name") pp.add_str('1.1',"md status") # 1.2 is in the next loop pp.add_str('1.3',"md raid type") pp.add_str('1.4',"Disks in software raids") pp.add_str('1.5',"Disks Faulty Status") pp.add_str('1.6',"DiskX belongs to mdX") i = 0 for mdCounter,mdValue in enumerate(sorted(md['devices'])): # 1.0.x md names pp.add_str('1.0.' + str(mdCounter),mdValue) # 1.1.x md status pp.add_str('1.1.' + str(mdCounter),to_bool(md['devices'][mdValue]['active'])) # 1.2.x.0.0 md disk count pp.add_str('1.2.' + str(mdCounter) + '.0.0', mdValue + " disk count") # 1.2.x.1.0 md non faulted disk count pp.add_str('1.2.' + str(mdCounter) + '.1.0', mdValue + " non faulted disk count") # 1.3.x md type pp.add_str('1.3.' + str(mdCounter),md['devices'][mdValue]['personality']) fault = 0; count = 0 for dCounter,dValue in enumerate(md['devices'][mdValue]['disks']): pp.add_str('1.4.' + str(i),dValue) disk_fault=to_bool(md['devices'][mdValue]['disks'][dValue]['faulty'],"rev") fault = fault + to_bool(md['devices'][mdValue]['disks'][dValue]['faulty']) pp.add_str('1.5.' + str(i),disk_fault) pp.add_str('1.6.' + str(i), mdValue) i = i + 1 count = count + 1 # 1.2.x.0.x & 1.2.x.1.x pp.add_str('1.2.' + str(mdCounter) + '.0.1', count) pp.add_str('1.2.' + str(mdCounter) + '.1.1', fault) LM_FileMonitor.py: (/usr/share/snmp/extensions/lm/libraries) #!/usr/bin/python import sys sys.path.append('/usr/share/snmp/extensions/lm/libraries') import os import time import glob from stat import * dirs = [] # Nice little function to convert True / None to 0 and False to 1 def to_bool(s): if s == True or s == None: return 0 elif s == False: return 1 else: return s def testModule(config): for dir in config['file_monitor']['directories']: if os.path.isdir(dir): dirs.append(dir) if len(dirs) > 0: return True else: return False def doUpdate(pp,config): # Test if we should run this module and return anything if testModule(config): x = 0 pp.add_str('2.0.0',"Directories") pp.add_str('2.0.1',dirs) for dir in dirs: os.chdir(dir) pp.add_str('2.2.0', "Last Check Time (epoch)") pp.add_str('2.2.1', time.time()) pp.add_str('2.2.3', "Files") pp.add_str('2.2.4', "Modification Time at last check (epoch)") pp.add_str('2.2.5', "Full Path") pp.add_str('2.2.6', "File size") pp.add_str('2.2.7', "Created (epoch)") pp.add_str('2.2.8', "Access Time (epoch)") for counter,value in enumerate(glob.glob("*")): statinfo = os.stat(value) pp.add_str('2.2.3.' + str(counter), value) pp.add_str('2.2.4.' + str(counter), str(statinfo[8])) pp.add_str('2.2.5.' + str(counter), dir + '/' + value) pp.add_str('2.2.6.' + str(counter), str(statinfo[6])) pp.add_str('2.2.7.' + str(counter), str(statinfo[9])) pp.add_str('2.2.8.' + str(counter), str(statinfo[7])) x = x + 1 else: pp.add_str('2',"LM_FileMonitor module did not find any directories to monitor") LM_dns_stats.py: (/usr/share/snmp/extensions/lm/libraries) #!/usr/bin/python import sys sys.path.append('/usr/share/snmp/extensions/lm/libraries') import os import re import subprocess dns_stats = [] dns_desc = [] def testModule(config): for instance in config['dns']['instances']: stats_file = instance['stats'] if os.path.exists(stats_file) and os.access(stats_file, os.R_OK): dns_stats.append(stats_file) dns_desc.append(instance['name']) if len(dns_stats) > 0: return True else: return False def doUpdate(pp,config): if testModule(config): i,dnsx = 0,0 for stats_file in dns_stats: if dnsx == len(config['dns']['instances']): dnsx = 0 pp.add_str('3.0.' + str(dnsx), str(dns_desc[dnsx])) pp.add_str('3.1.' + str(dnsx),"Stats from " + stats_file) script = "/usr/share/snmp/extensions/lm/helper_scripts/stats.sh" incoming = subprocess.check_output([script, stats_file, 'incoming']) outgoing = subprocess.check_output([script, stats_file, 'outgoing']) resolver = subprocess.check_output([script, stats_file, 'resolver']) socket = subprocess.check_output([script, stats_file, 'socket']) ## Incoming i = 0 for s in re.split('\n',incoming): data = re.split(',', s) if len(data[0]) > 0 and data[1] > 0: pp.add_str('3.2.0.0.' + str(i), data[0]) pp.add_str('3.2.1.' + str(dnsx) + '.' + str(i), data[1]) i = i + 1 ## Outgoing i = 0 for s in re.split('\n',outgoing): data = re.split(',', s) if len(data[0]) > 0: pp.add_str('3.3.0.0.' + str(i), data[0]) pp.add_str('3.3.1.' + str(dnsx) + '.' + str(i), data[1]) i = i + 1 ## resolver i = 0 for s in re.split('\n',resolver): data = re.split(',', s) if len(data[0]) > 0: pp.add_str('3.4.0.0.' + str(i), data[0]) pp.add_str('3.4.1.' + str(dnsx) + '.' + str(i), data[1]) i = i + 1 ## socket i = 0 for s in re.split('\n',socket): data = re.split(',', s) if len(data[0]) > 0: pp.add_str('3.5.0.0.' + str(i), data[0]) pp.add_str('3.5.1.' + str(dnsx) + '.' + str(i), data[1]) i = i + 1 dnsx = dnsx + 1 else: pp.add_str('3' + "LM_dns_stats module did not find any bind intances from the config file") stats.sh: (/usr/share/snmp/extensions/lm/helper_scripts/) #!/bin/bash file=${1} [ -x ${file} ] && exit 0 [ -f ${file} ] || exit 0 dnsNames=(A A6 AAAA ANY CNAME DNSKEY DS MX NAPTR NS PTR SOA SPF SRV TXT) resNames=('mismatch responses received' 'IPv4 queries sent' 'IPv4 responses received' 'NXDOMAIN received' 'SERVFAIL received' 'FORMERR received' 'query retries' 'query timeouts' 'queries with RTT < 10ms' 'queries with RTT 10-100ms' 'queries with RTT 100-500ms' 'queries with RTT 500-800ms' 'queries with RTT 800-1600ms' 'queries with RTT > 1600ms') sockNames=('UDP/IPv4 sockets opened' 'UDP/IPv4 sockets closed' 'UDP/IPv4 socket bind failures' 'UDP/IPv4 connections established' 'UDP/IPv4 recv errors' 'TCP/IPv4 sockets opened' 'TCP/IPv4 sockets closed' 'TCP/IPv4 socket bind failures' 'TCP/IPv4 connections established' 'TCP/IPv4 recv errors') now_epoch=$(date +%s) mtime_epoch=$(stat ${file} -c %W) function updateStats { [ -f ${file} ] && [ $((now_epoch-mtime_epoch)) -gt 300 ] && rm -f $file && rndc stats } function getStats { start=$1 end=$2 regx1="sed -n '/${start}/,/${end}/p'" data=$(cat ${file} | \ eval ${regx1} | \ egrep '[0-9]' | \ awk '{ print $1" "$2 }') while read value name; do names+=(${name}) values+=("${value}") done <<< "${data}" for n in $(eval echo \${${3}[@]}); do regx2="${n}" if [[ ! "${names[@]}" =~ "$regx2" ]]; then names+=(${n}) values+=("0") fi done x=0 for n in ${names[@]}; do echo "${names[$x]},${values[$x]}" x=$((x+1)) done } function inStats { getStats "Incoming Q" "Outgoing Q" "dnsNames" } function outStats { getStats "Outgoing Q" "^+" "dnsNames" } function resStats { for n in "${resNames[@]}"; do regexp=" ${n}$" name=$(echo ${n}) value=$(egrep "${regexp}" ${file} | \ sed -n 's/.* \([0-9]*\) \([A-Za-z].*\)/\1/p') [ ${#value} -eq 0 ] && value=0 echo "${name},${value}" done } function resSocket { for n in "${sockNames[@]}"; do regexp=" ${n}$" value=$(egrep "${regexp}" ${file} | \ sed -n 's/.* \([0-9]*\) \([A-Za-z].*\)/\1/p') [ ${#value} -eq 0 ] && value=0 echo "${n},${value}" done } # all returns get the timestamps.. echo "stats_epoch,${mtime_epoch}" echo "now_epoch,${now_epoch}" case ${2} in incoming) inStats;; outgoing) outStats;; resolver) resStats;; socket) resSocket;; *) inStats;; esac # Call createStats last.. if we destroy first then query the file, values will be small to 0 as the values start at time of stats creation. If we call at the end, the next polling will have data during the time period from poll to poll. updateStats exit 0
  12. Hey guys, I just thought of something interesting that would be a nice to have. Rather than exporting the entire dashboard to xml it would be neat to export a widget.. The same with importing... when adding a new widget one of the options would be widget from xml. That way people could post sample/ templates of a single widget and others can manipulate the data / import it in to their dashboard. The thought came to me today as had just got done adding two graphs to a dashboard but am now export to xml to add 20+ datapoints to each graph.
  13. Additional Note, it looks like this goes along with
  14. I'd like to second this... We need this functionality (like nagios event handlers) If ____ fails, restart service If restarting service _____ times and _____ is still failing, THEN send an alert. or worded example.. let's say dns resolution is failing.... rare but usually named needs to be restarted and life goes on... why do I need to be paged to do that at 2 in the morning when LM can do it for me ? "dns resolution has failed" try to restart the service twice (user specified) If dns resolution has not resolved itself, then page that the service is down and that we have tried to restart the service twice without resolution.
  15. Thank you for this.. I was kicking around ideas to figure this out but you beat me to it!