FYI: LM can trigger ESXi 6.5 hostd to crash


Recommended Posts

Hi,

I just got done working with VMware support on an issue where our ESXi 6.5 hostd process would crash during a booting phase.  We eventually traced it back to a bug in some vSAN code that LM monitoring is polling..  It doesn't matter if you're running vSAN in your environment or not.  Our work around has been to disable host level monitoring in LM for our ESXi hosts for now and it's been stable ever since.

The expected fix is scheduled for release in Q3 2018 from VMware.

  • Upvote 2
Link to post
Share on other sites
  • 2 weeks later...
  • LogicMonitor Staff

@Eric Singer, @Ryan B, @PatrickATL,

I've got an update on this, I appreciate your patience.

First off, I want to note that neither the collector nor any DataSources are explicitly calling vSAN methods, whether you have it installed and enabled or not. We ship the vSAN SDK with newer collector versions, but there aren't any core DataSources using it just yet.

The vSAN queries that brought down @Eric Singer's device appear to be triggered on the server when we call methods to get hardware and version information about a given host. This is based off seeing the same behavior both with the official VMware SDK, and the opensource YAVIJAVA SDK. This means we can't avoid making at least some inadvertent calls to vSAN unless VMware changes this, but we do have a mitigation route.

Specifically, these three things can trigger the calls:

  • Auto Properties identifying your ESX host and updating version info (this runs infrequently and doesn't generate many calls, likely not an issue)
  • VMware_vSphere_HostPerformance's AD script - This is the biggest offender, and kicks off about 30 vSAN calls in our test environment. A fix is in the work, but it won't be backwards compatible with the current version as the instance names will change. The fix currently only triggers 5 vSAN calls for each AD run when applies directly to ESX.
  • VMware_vSphere_HardwareSensors AD script - Only triggers once per call, likely not an issue
     

The effect is larger when the modules are applied to ESX directly. When those modules are applied to vCenter, some vSAN calls are still made on the host, but not as many (1-4).

Based on the great info we got from @Eric Singer and VMware, we're confident that the changes to VMware_vSphere_HostPerformance will sufficiently mitigate this issue.

We haven't yet been able to reproduce the crash in our lab by rebooting and forcing AD repeatedly.

@PatrickATL, I appreciate the offer. You might check /var/log/hostd.log for floods of calls to vSAN. Luckily the conditions for a crash seem fairly difficult to come by.
 
I will update this ticket when the fix is released and make sure our Customer Success team gets this info to the other 6.5 users. In the meantime, you should consider disabling VMware_vSphere_HostPerformance on ESX hosts you expect to reboot; you can still safely monitor them through vCenter. Expect a fix early next week.

Thanks again for your help on this. Please reach out if you have additional questions or concerns.

Link to post
Share on other sites
  • LogicMonitor Staff

We've released an updated version of VMware_vSphere_HostPerformance. It breaks backwards compatibility with the version 1 series. It also only applies to vCenter by default, to further mitigate vSAN calls triggered directly on the host. When applied directly, AD now triggers 5 vSAN calls as opposed to 30.

If you want to keep the historical data before upgrading, you can rename version 1.x of the DataSource and then disable it.

The locator code to get version 2 is 99EKKN
 

Link to post
Share on other sites
  • 2 weeks later...
On 6/5/2018 at 6:36 PM, Michael Rodrigues said:

We've released an updated version of VMware_vSphere_HostPerformance. It breaks backwards compatibility with the version 1 series. It also only applies to vCenter by default, to further mitigate vSAN calls triggered directly on the host. When applied directly, AD now triggers 5 vSAN calls as opposed to 30.

If you want to keep the historical data before upgrading, you can rename version 1.x of the DataSource and then disable it.

The locator code to get version 2 is 99EKKN
 

 

On 6/5/2018 at 6:36 PM, Michael Rodrigues said:

We've released an updated version of VMware_vSphere_HostPerformance. It breaks backwards compatibility with the version 1 series. It also only applies to vCenter by default, to further mitigate vSAN calls triggered directly on the host. When applied directly, AD now triggers 5 vSAN calls as opposed to 30.

If you want to keep the historical data before upgrading, you can rename version 1.x of the DataSource and then disable it.

The locator code to get version 2 is 99EKKN
 

 

Link to post
Share on other sites

DO NOT comment out the applies to field on the datasource!  This will remove all historical data - which I can only imagine most of us want to keep.  You can disable the datasource by creating a device group (if you don't have one already) and populating it with all of the ESX hosts. Then, at the group level, select the alert tuning tab and uncheck the box next to the datasource.  This disables polling and alerting, but allows you to keep historical data.

Link to post
Share on other sites
  • 2 weeks later...

No KB that i'm aware of.  Their RCA was...

 

 

Good Morning!

Here is the root cause our Engineering has identified,
Looking at the threads in hostd, we see that there are lots of threads blocked on the lock of the host managed object.
11 threads (threads 12, 14, 15, 16, 17, 18, 19, 20, 21, 26, 27) were blocked trying to read-lock the host.

The thread that holds the read lock is thread 2. It is blocked in some vsan.

A code in the GetRuntime() property decided to perform some RPC operations and blocked waiting on a condition variable. This caused a deadlock.
This depends on whether the event that the vsan stub was waiting for would be generated from an I/O thread (in which case the thread would eventually be unblocked), or the event needed a worker thread to be generated (in which case it would be a deadlock by thread starvation).

As the root cause for the bug is that a piece of VSAN code which is causing a deadlock, our Engineering is working with vSAN team to get the insight of the respective property.




 

Edited by Eric Singer
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.