Auto Balanced Collector Groups need a rework


SeanC
 Share

Recommended Posts

Currently, if a collector goes down the devices it was monitoring will only failover to other collectors in the group if doing so does not put them over the defined Rebalance Threshold on the ABCG.

This means you can't specify a Rebalance Threshold that allows instances to rebalance evenly across your collectors AND allow for instances to failover, it's one or the other which defeats the whole purpose and takes the Auto out of Auto Balance as to have collectors run in an n-1 highly available configuration means you have to specify a Rebalance Threshold value that is higher than the total instance count divided by the number of collectors minus one, then after a failover you have to adjust the Rebalance Threshold to a number close to the total number of instances divided by number of collectors, trigger a rebalance, wait, and then set the Rebalance Threshold back to total instance count divided by the number of collectors minus one.

Madness!

I propose keeping the Rebalance Threshold field but making it actually behave like it's name; when instance count is over the value, collector tries to offload instances to other collectors in the group.

BUT, allow instances from a failed collector to offload to collectors past that threshold value so that failover continues to work.

To address the concern of collectors being pushed past their capacity, add an additional field called Max Allowed Instances and only disallow instances to be offloaded to a collector if the instance count would push it past that value and trigger an alert in the event that that happens.

This will allow us to have HA configurations AND auto balancing of instances work at the same time, as well as alerting us to the fact that instances have not failed over when it happens so that steps can be taken to increase the sizing/number of collectors.

 

  • Like 1
Link to comment
Share on other sites

On 9/23/2021 at 2:04 AM, Michael Rodrigues said:

@SeanC ABCG should not prevent failover even when the target collector is/would be above the rebalance threshold. If you're seeing this behavior I suggest reaching out to support to see if they can help you out.

 

We had an incident where instances didn't failover as expected so I raised a ticket.

Support staff were the ones who advised me that collectors won't take instances over the threshold during a failover.

I questioned the advice at the time as it was contrary to the way I interpreted the documentation but apparently the advice was verified with another engineer and is accurate.

I guess I'll have to set up some tests and verify.

Link to comment
Share on other sites

  • LogicMonitor Staff

@SeanC I've spoken with the Collector development team about this. They've confirmed that my answer above is accurate.

I'll speak to support about the confusion. Collector dev is also going to take a look at the ticket to see what went wrong in your portal with failover.

I hope you didn't have to spend too much time doing your own verification on this. Sorry for any hassle.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share