This article provides information on High CPU usage on the Collector . (1) General Best Practices (a) First and foremost we advise our customers to be on latest General Release Collectors (unless advised not to) . Further information all the Collector information could be retrieved on the link below : https://www.logicmonitor.com/support/settings/collectors/collector-versions/ Also on the release notes of each newer Collector version we will indicate if we have fixed any known issues : https://www.logicmonitor.com/releasenotes/ (b) Please also view our Collector Capacity guide to get a full overview on how to optimise the Collector Performances : https://www.logicmonitor.com/support/settings/collectors/collector-capacity/ (c) When providing information on High CPU usage it would be useful if you can advise if the High CPU usage is all the time or a certain timeframe only (also if any environmental changes were done on physical machine that may have triggered this issue). Please do advise also if this occurred after adding newer devices on the collector or if this issue occurs after applying a certain version of the Collector. (2) Common Issues On this topic i will go through some of the common issues which have been fixed or worked upon by our Development Teams : (A) Check if the CPU is used by the Collector (Java Process) or SBproxy or other processes. (i) To monitor Collector Java Process : Use the datasource Collector JVM status to check the Collector (Java process) CPU usage (as shown below). (ii) To monitor the SBProxy usage : We can use the datasource : WinProcessStats.xml (for Windows collector / For Linux data source (this datasource is still being developed) . (B) If the high CPU usage is caused by the Collector Java processes, below are some of the common causes : (i) Collector java process using high CPU How confirm if this the similar issue : In the Collector Wrapper Logs you are able to view this error message : In our Collector wrapper.log, you can see a lot of logs like the below: DataQueueConsumers$DataQueueConsumer.run:338] Un-expected exception - Must be BUG, fix this, CONTEXT=, EXCEPTION=The third long is not valid version - 0 java.lang.IllegalArgumentException: The third long is not valid version - 0 at com.santaba.agent.reporter2.queue.QueueItem$Header.deserialize(QueueItem.java:66) at com.santaba.agent.reporter2.queue.impl.QueueItemSerializer.head(QueueItemSerializer.java:35) This issue has been in Collector version EA 23.200 (ii) CPU load spikes on Linux Collectors As shown in the image below the CPU usage of Collector Java process has a periodic CPU spike (on an hourly basis) . This issue has been fixed on Collector version EA 23.026 (iii) Excessive CPU usage despite not having any devices running on it In the collector wrapper.log, you can see similar logs as below : [04-11 10:32:20.653 EDT] [MSG] [WARN] [pool-20-thread-1::sse.scheduler:sse.scheduler] [SSEChunkConnector.getStreamData:87] Failed to get SSEStreamData, CONTEXT=current=1491921140649(ms), timeout=10000, timeUnit=MILLISECONDS, EXCEPTION=null java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at com.logicmonitor.common.sse.connector.sseconnector.SSEChunkConnector.getStreamData(SSEChunkConnector.java:84) at com.logicmonitor.common.sse.processor.ProcessWrapper.doHandshaking(ProcessWrapper.java:326) at com.logicmonitor.common.sse.processor.ProcessorDb._addProcessWrapper(ProcessorDb.java:177) at com.logicmonitor.common.sse.processor.ProcessorDb.nextReadyProcessor(ProcessorDb.java:110) at com.logicmonitor.common.sse.scheduler.TaskScheduler$ScheduleTask.run(TaskScheduler.java:181) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) This issue has been fixed on EA 24.085 (iv) SSE process stdout and stderr stream not consumed in Windows Please note this issue occurs on only on Windows Collectors and the CPU usage of the Windows operating system has a stair-step shape as shown below. This has been fixed in Collector EA 23.076 (v) Collector goes down intermittently on daily basis In the Collector wrapper.logs, you can see similar log lines : [12-21 13:10:48.661 PST] [MSG] [INFO] [pool-60-thread-1::heartbeat:check:4741] [Heartbeater._printStackTrace:265] Dumping HeartBeatTask stack, CONTEXT=startedAt=1482354646203, stack= Thread-40 BLOCKED java.io.PrintStream.println (PrintStream.java.805) com.santaba.common.logger.Logger2$1.print (Logger2.java.65) com.santaba.common.logger.Logger2._log (Logger2.java.380) com.santaba.common.logger.Logger2._mesg (Logger2.java.284) com.santaba.common.logger.LogMsg.info (LogMsg.java.15) com.santaba.agent.util.Heartbeater$HeartBeatTask._run (Heartbeater.java.333) com.santaba.agent.util.Heartbeater$HeartBeatTask.run (Heartbeater.java.311) java.util.concurrent.Executors$RunnableAdapter.call (Executors.java.511) java.util.concurrent.FutureTask.run (FutureTask.java.266) java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java.1142) java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java.617) java.lang.Thread.run (Thread.java.745) [12-21 13:11:16.597 PST] [MSG] [INFO] [pool-60-thread-1::heartbeat:check:4742] [Heartbeater._printStackTrace:265] Dumping HeartBeatTask stack, CONTEXT=startedAt=1482354647068, stack= Thread-46 RUNNABLE java.io.PrintStream.println (PrintStream.java.805) com.santaba.common.logger.Logger2$1.print (Logger2.java.65) com.santaba.common.logger.Logger2._log (Logger2.java.380) com.santaba.common.logger.Logger2._mesg (Logger2.java.284) com.santaba.common.logger.LogMsg.info (LogMsg.java.15) com.santaba.agent.util.Heartbeater$HeartBeatTask._run (Heartbeater.java.320) com.santaba.agent.util.Heartbeater$HeartBeatTask.run (Heartbeater.java.311) java.util.concurrent.Executors$RunnableAdapter.call (Executors.java.511) java.util.concurrent.FutureTask.run (FutureTask.java.266) gobler terminated ERROR 5296 java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java.1142) java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java.617) java.lang.Thread.run (Thread.java.745) This issue has now been fixed in Collector EA 22.228 (C) High CPU usage caused by SBProxy (i) Collector CPU spikes until 99% The poor performance of WMI or PDH data collection on some cases will cause too many retries will occur and this consumes a lot of CPU. In the collector sbproxy.log, you can search the log string as shown below and you can see the retry times is nearly 100 per request and subsequently this will consume a lot of CPU. ,retry: This is being investigated by our development team at this time and will be fixed in the near future . (3) Steps to take when facing high CPU usage for Collector (i) Ensure the collector has been added as a device and enabled for monitoring : https://www.logicmonitor.com/support/settings/collectors/monitoring-your-collector/ There are set of New Datasources for the Collector (LogicMonitor Collector Monitoring Suite - 24 DataSources) which as shown below and please ensure they have been updated in your portal and applied to your Collectors and also ensure the Linux CPU or Windows CPU datasources have been applied to the Collector : (ii) Record a JFR (java flying record) in debug command window of the Collector : this can done through this method : // unlock commercial feature !jcmd unlockCommercialFeatures // start a jfr , in real troubleshooting case, should increase the duration a reasonable value. !jcmd duration=1m delay=5s filename=test.jfr name=testjfr jfrStart // stop a jfr !jcmd name=testjfr jfrStop // upload the jfr record !uploadlog test.jfr (iii) Upload the Collector Logs : From the Manage dialog you can send your logs to LogicMonitor support. Select the manage gear icon for the desired collector and then select 'Send logs to LogicMonitor': Credits: LogicMonitor Collector development team for providing valuable input in order to publish this article .
Often times our Tech Support team encountered customer's Collector questions such as how to navigate, configure and where to find hints if the Collector is not behaving as it should, hence I am here to share some basic usage and tips and tricks for troubleshooting Collectors. We will be covering 4 topics in this aticle: 1) Collector Settings 2) Collector Event Logs 3) Running Debug commands (simple commands which can help with basic troubleshooting) 4) Making changes on Collector Configuration 1)Collector Settings (a) How to check the Collector settings: As shown in the above screenshot, the Settings - Collector: this is where you can gather all your collector information. Key information provided here: Includes the Collector ID (this is an ID used to identify the collector on your portal). Device name and description of the Collector. Collector version that is being used. Managing the collector. Collector Logging. The Total number of devices and services managed by the collector and also the OS platform of the collector and if the collector is in Scheduled Downtime state or not. In addition, you will see the red symbol which indicates if a collector is down. (b) Expand the collector to view further Collector Settings: This is where you can gather more information like Last updated and next updates for the collector, which is the escalation chain associated with the collector and which is the relevant Stage 1 recipients. In this option also you can download a collector or delete an existing collector (please do note when deleting a collector you have appropriately assigned the devices to other collectors before the deletion). (c) Manage Collector In the Manage Collector option this is where you could configure the collector grouping , escalation chains and choose the appropriate failover collectors. (d) Checking the Devices managed by the collector : On this option we can check which are the devices managed by the collector . Also we can change the preferred collector option on the device. (e) Creating an SDT (scheduled downtime task) on the Collector. You can go to the collector and Choose SDT and Add SDT and the first image shows you the timeframe you could configure and next image shows the status of your collector. 2) Collector Event Logs (a) How to get to collector events Go to Manage collector as shown previous and click on the support tab and once this is done Choose Collector Events. (b)Search on Collector Events On the collector events you will be able to see various information as shown below. By default the collector will restart every 8 hours and this will be written here (if there is a number of occurrence of collector restarting besides the 8 hour timeframe , you could come to tech support for further advise). Also if the collector constantly shows in a down state (we would advise you to contact technical support for further investigation). Keywords that you can search includes down ,restarted , timeout , unable to execute and also you could look for the hostname or IP address of your device to see whats the latest Collector event written for it. (3) Running Debug commands (simple commands which can help with basic troubleshooting) (a) How to access Debug command : Go to to Manage and Support - Select Run Debug Command. (b) Debug Screen When you access the debug screen it will show you all the lists of commands . All debug commands should be preceded with a '!'. In the list of built in commands, triangular brackets (i.e. < >) indicate a value that should be replaced and rectangular brackets (i.e. [ ]) indicate an optional argument that may be included. If you need an example of the syntax for a particular command, type 'help !command. (c) !TLIST command Running !TLISTwould show all the tasks being processed by the collector on all devices. Running a !TLIST h=hostname will show the tasklist only for that particular device. As you can see here the !TLIST includes the task id , collection method , status which should as scheduling (if you see waiting this indicates the collector is taking a long time to execute any tasks or in a busy state) , hostname and also the datasource name and last execution this indicates the status of the collection , typically this should show as OK ,however if there no data collection you might see NaN or not executed yet or other errors. (d) TDETAIL command : It will provide more detailed information on the tasks. A !TDETAIL command would require the task id also included. Example as shown i've how i've retrieved a task id from the previous !TLIST command used it for this !TDETAIL command. (e) !ADLIST command This is useful to be used when you would like to check the status of active discovery on your collectors to see if the appropriate datasources are applying or not to your devices. You could also run an !ADLIST h=hostname to get information on individual devices. Key information you would need to check is the Status and Message shown and the !ADLIST id. (f) !ADETAIL command The !ADETAIL command is used to get further information on a !ADLIST 14904353941600001(which is !ADLIST id) and check full details if a Datasource isn't applied. (g) !UPTIME and !PING and !ACCOUNT !UPTIME is used to determine the collector uptime. !PING is used to ping a device to check if the device is available and responding. !ACCOUNT is used to determine which is account used to install and manage the collector (this is for Windows platform only). (h) !LOGSURF command !LOGSURF taskid= this is used to get further information from the logs on a status of a taskid (which can be retrieved from a !TLIST command). Sometimes when you run !LOGSURF taskid= it may not return any information , please try it a few times before it returns information. (4) Making changes on Collector Configuration LogicMonitor Collectors have configuration files that can be used to control their behaviour. We can change this configuration files by going : Collectors | Manage | Support | Collector Configuration as shown below : We have four different tabs but i will cover the Collector Config and Wrapper config tab as this is commonly used. (a) Making changes in Collector Config tab If you would like to make any changes in the Collector Configuration tab you would need to put a checkmark for override agent.config . Only then you can write information in the Collector configuration. Here as you can see we could could enable/disable snmp collector , increase/decrease the snmp timeout , increase or decrease the snmp threadpool and also change the snmp asynchronous status. This could be done for other data collection methods also. Please take note after you make changes you would need to restart the Collector from the LogicMonitor portal , you must click on the Option to Save and Restart to ensure the configuration is saved. (b) Making changes in Wrapper Config tab On the wrapper configuration we could change the Initial Java Heap Size (in MB) and Maximum Java Heap Size (in MB). Please do consult the Technical Support team before making any changes in the Collector Configuration tabs or if uncertain on the impact of the changes you are making.