You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The user is trying the loadBalancingWeight for failover hosts. Currently, they are using loadBalancingWeight only for primary hosts.
While trying that out they noticed that when localityWeightedLbConfig: {} and maglev: {} both are set together under spec.loadBalancerConfig in Upstream, we consistently get 503 errors when we invoke an API endpoint, even though all the hosts -- primary and failover ones are correct and are healthy. When we look at the respCodeDetails, we see it as no_healthy_upstream. This error goes away if we on purpose make the primary host invalid OR keep only either localityWeightedLbConfig or maglev.
It seems that the failover plugin is affecting the health checks.
@andy-fong and I were troubleshooting this on Slack and here is the latest we found:
service-blue.default.svc.cluster.local doesn't have priority set, so it's default to 0. The other 2 service-green.default.svc.cluster.local and service-yellow.default.svc.cluster.local are set to priority 1 and Andy sees this:
[2024-12-13 23:57:35.341][1][trace][upstream] [external/envoy/source/common/upstream/load_balancer_impl.cc:247] recalculated priority state: priority level 0, healthy weight 0, total weight 1, overprovision factor 140, healthy result 0, degraded result 0
[2024-12-13 23:57:35.341][1][trace][upstream] [external/envoy/source/common/upstream/load_balancer_impl.cc:247] recalculated priority state: priority level 1, healthy weight 1, total weight 3, overprovision factor 140, healthy result 46, degraded result 0
priority 0 has not healthy weight but then it should fall down to level 1 which seems to have 1 health weight. But what's strange is that I am seeing quite a few of these even when the healthcheck gets a 200 response, it still marks it failed:
K8sGateway Version
1.17.3
Kubernetes Version
1.30.6
Describe the bug
The user is trying the loadBalancingWeight for failover hosts. Currently, they are using loadBalancingWeight only for primary hosts.
While trying that out they noticed that when localityWeightedLbConfig: {} and maglev: {} both are set together under spec.loadBalancerConfig in Upstream, we consistently get 503 errors when we invoke an API endpoint, even though all the hosts -- primary and failover ones are correct and are healthy. When we look at the respCodeDetails, we see it as no_healthy_upstream. This error goes away if we on purpose make the primary host invalid OR keep only either localityWeightedLbConfig or maglev.
It seems that the failover plugin is affecting the health checks.
@andy-fong and I were troubleshooting this on Slack and here is the latest we found:
service-blue.default.svc.cluster.local doesn't have priority set, so it's default to 0. The other 2 service-green.default.svc.cluster.local and service-yellow.default.svc.cluster.local are set to priority 1 and Andy sees this:
priority 0 has not healthy weight but then it should fall down to level 1 which seems to have 1 health weight. But what's strange is that I am seeing quite a few of these even when the healthcheck gets a 200 response, it still marks it failed:
A good health check looks exactly the same:
Expected Behavior
When using the Maglev consistent hashing method the failover plugin should not alter health checks of healthy upstream hosts.
Steps to reproduce the bug
Healthy
Unhealthy
503 with maglev:
healthy:
gateway-proxy-maglev-trace.log
config_dump.json
Additional Environment Detail
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: