Eric Singer

FYI: LM can trigger ESXi 6.5 hostd to crash

Recommended Posts

Hi,

I just got done working with VMware support on an issue where our ESXi 6.5 hostd process would crash during a booting phase.  We eventually traced it back to a bug in some vSAN code that LM monitoring is polling..  It doesn't matter if you're running vSAN in your environment or not.  Our work around has been to disable host level monitoring in LM for our ESXi hosts for now and it's been stable ever since.

The expected fix is scheduled for release in Q3 2018 from VMware.

  • Upvote 2

Share this post


Link to post
Share on other sites

Hey @Eric Singer, thanks for bringing this to our attention. We've got our Collector Team looking into how to mitigate this now.

We're also working to identify customers monitoring ESXi 6.5 so we can notify them proactively.

I will update this thread as we learn more.

Share this post


Link to post
Share on other sites

Does this only effect ESXi hosts directly added; or ESXi hosts monitored underneath a vCenter added to logicmonitor?

Share this post


Link to post
Share on other sites
On 5/21/2018 at 1:13 PM, Ryan B said:

Does this only effect ESXi hosts directly added; or ESXi hosts monitored underneath a vCenter added to logicmonitor?

 

Only for hosts directly added

Share this post


Link to post
Share on other sites

Do you mind posting a screenshot or a list of the datasources you have applied to your hosts? Also what build of 6.5 are you running? 

Share this post


Link to post
Share on other sites

@Eric Singer, @Ryan B, @PatrickATL,

I've got an update on this, I appreciate your patience.

First off, I want to note that neither the collector nor any DataSources are explicitly calling vSAN methods, whether you have it installed and enabled or not. We ship the vSAN SDK with newer collector versions, but there aren't any core DataSources using it just yet.

The vSAN queries that brought down @Eric Singer's device appear to be triggered on the server when we call methods to get hardware and version information about a given host. This is based off seeing the same behavior both with the official VMware SDK, and the opensource YAVIJAVA SDK. This means we can't avoid making at least some inadvertent calls to vSAN unless VMware changes this, but we do have a mitigation route.

Specifically, these three things can trigger the calls:

  • Auto Properties identifying your ESX host and updating version info (this runs infrequently and doesn't generate many calls, likely not an issue)
  • VMware_vSphere_HostPerformance's AD script - This is the biggest offender, and kicks off about 30 vSAN calls in our test environment. A fix is in the work, but it won't be backwards compatible with the current version as the instance names will change. The fix currently only triggers 5 vSAN calls for each AD run when applies directly to ESX.
  • VMware_vSphere_HardwareSensors AD script - Only triggers once per call, likely not an issue
     

The effect is larger when the modules are applied to ESX directly. When those modules are applied to vCenter, some vSAN calls are still made on the host, but not as many (1-4).

Based on the great info we got from @Eric Singer and VMware, we're confident that the changes to VMware_vSphere_HostPerformance will sufficiently mitigate this issue.

We haven't yet been able to reproduce the crash in our lab by rebooting and forcing AD repeatedly.

@PatrickATL, I appreciate the offer. You might check /var/log/hostd.log for floods of calls to vSAN. Luckily the conditions for a crash seem fairly difficult to come by.
 
I will update this ticket when the fix is released and make sure our Customer Success team gets this info to the other 6.5 users. In the meantime, you should consider disabling VMware_vSphere_HostPerformance on ESX hosts you expect to reboot; you can still safely monitor them through vCenter. Expect a fix early next week.

Thanks again for your help on this. Please reach out if you have additional questions or concerns.

Share this post


Link to post
Share on other sites

We've released an updated version of VMware_vSphere_HostPerformance. It breaks backwards compatibility with the version 1 series. It also only applies to vCenter by default, to further mitigate vSAN calls triggered directly on the host. When applied directly, AD now triggers 5 vSAN calls as opposed to 30.

If you want to keep the historical data before upgrading, you can rename version 1.x of the DataSource and then disable it.

The locator code to get version 2 is 99EKKN
 

Share this post


Link to post
Share on other sites
On 6/5/2018 at 6:36 PM, Michael Rodrigues said:

We've released an updated version of VMware_vSphere_HostPerformance. It breaks backwards compatibility with the version 1 series. It also only applies to vCenter by default, to further mitigate vSAN calls triggered directly on the host. When applied directly, AD now triggers 5 vSAN calls as opposed to 30.

If you want to keep the historical data before upgrading, you can rename version 1.x of the DataSource and then disable it.

The locator code to get version 2 is 99EKKN
 

 

On 6/5/2018 at 6:36 PM, Michael Rodrigues said:

We've released an updated version of VMware_vSphere_HostPerformance. It breaks backwards compatibility with the version 1 series. It also only applies to vCenter by default, to further mitigate vSAN calls triggered directly on the host. When applied directly, AD now triggers 5 vSAN calls as opposed to 30.

If you want to keep the historical data before upgrading, you can rename version 1.x of the DataSource and then disable it.

The locator code to get version 2 is 99EKKN
 

 

Share this post


Link to post
Share on other sites

DO NOT comment out the applies to field on the datasource!  This will remove all historical data - which I can only imagine most of us want to keep.  You can disable the datasource by creating a device group (if you don't have one already) and populating it with all of the ESX hosts. Then, at the group level, select the alert tuning tab and uncheck the box next to the datasource.  This disables polling and alerting, but allows you to keep historical data.

Share this post


Link to post
Share on other sites

@Eric Singer - Any chance VMWare provided you with a KB that documents this as a known issue / bug?  I'd like to provide as much context as possible to our ESX admins.

Thanks!

Share this post


Link to post
Share on other sites

No KB that i'm aware of.  Their RCA was...

 

 

Good Morning!

Here is the root cause our Engineering has identified,
Looking at the threads in hostd, we see that there are lots of threads blocked on the lock of the host managed object.
11 threads (threads 12, 14, 15, 16, 17, 18, 19, 20, 21, 26, 27) were blocked trying to read-lock the host.

The thread that holds the read lock is thread 2. It is blocked in some vsan.

A code in the GetRuntime() property decided to perform some RPC operations and blocked waiting on a condition variable. This caused a deadlock.
This depends on whether the event that the vsan stub was waiting for would be generated from an I/O thread (in which case the thread would eventually be unblocked), or the event needed a worker thread to be generated (in which case it would be a deadlock by thread starvation).

As the root cause for the bug is that a piece of VSAN code which is causing a deadlock, our Engineering is working with vSAN team to get the insight of the respective property.




 

Edited by Eric Singer

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now