Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes and improvements #1

Merged
merged 1 commit into from
Sep 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 42 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@ Bring Temporal Cloud Metrics into your Kubernetes cluster to inform autoscaling

![Metrics Dashboard Demo](./img/metrics-dashboard.jpg)

## How it works

This project is essentially just a proxy server. Kubernetes makes an HTTP call which is handled by this service which in turn pulls metrics from Temporal Cloud, converts them to the format that Kubernetes expects, and returns them to k8s.

Kubernetes will poll our service for metrics which become available to HPA's living in the same Kubernetes namespace.

![Architecture Diagram](./img/diagram.png)

## Setup

### Prerequisites
Expand Down Expand Up @@ -41,7 +49,7 @@ sum by(temporal_namespace) (
temporal_cloud_v0_poll_success_sync_count{}[1m]
)
)
-
/
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_count{}[1m]
Expand All @@ -51,27 +59,40 @@ sum by(temporal_namespace) (

__After__

We've made two important changes here: (1) we've swapped the places of the two underlying metrics to invert the resulting number so it will now be positive and increase as the Sync Match Rate falls, (2) use clamp_min to set a lower bound of zero, and (3) we default the resulting value to zero in the event no data points are available within the specified time window.
We've made two important changes here: (1) we've swapped the places of the two underlying metrics to invert the resulting number so it will now be positive and increase as the Sync Match Rate falls, and (2) we default the resulting value to `1` in the event no data points are available within the specified time window.

The result is a decimal that starts at `1` when there is a perfect Sync Match Rate and rises as the Sync Match Rate is declines.

```
sum(
clamp_min(
(
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_count{}[1m]
)
)
-
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_sync_count{}[1m]
)
)
),
0
(
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_count{
temporal_namespace="bitovi.x72yu"
}[1m]
)
)
) or vector(0)
/
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_sync_count{
temporal_namespace="bitovi.x72yu"
}[1m]
)
)
)
unless
(
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_sync_count{
temporal_namespace="bitovi.x72yu"
}[1m]
)
) == 0
)
or label_replace(vector(1), "temporal_namespace", "bitovi.x72yu", "", "")

```

### Step 3: HPA
Expand Down Expand Up @@ -175,8 +196,8 @@ You can adjust the how quickly the cluster scales up and down our workers.
temporal_namespace: xyz.123
target:
type: Value
# Scale up when the target metric exceeds 50 milli values (0.05)
value: 50m
# Scale up when the target metric exceeds 1500 milli values (1.5)
value: 1500m
behavior:
scaleUp:
# The highest value in the last 10 seconds will be used to determine the need to scale up
Expand Down
6 changes: 3 additions & 3 deletions chart/templates/demo-worker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ spec:
imagePullPolicy: Always
ports: []
volumeMounts:
- name: mtls-certs
- name: tcm-mtls-certs
mountPath: "/app/certs"
readOnly: true
env:
Expand All @@ -37,7 +37,7 @@ spec:
- name: TEMPORAL_QUEUE
value: autoscaler_demo
volumes:
- name: mtls-certs
- name: tcm-mtls-certs
secret:
secretName: mtls-certs
secretName: tcm-mtls-certs
{{- end }}
6 changes: 3 additions & 3 deletions chart/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ spec:
- mountPath: /var/run/serving-cert
name: serving-cert
readOnly: false
- name: mtls-certs
- name: tcm-mtls-certs
mountPath: "/app/certs"
readOnly: true
- name: config
Expand All @@ -59,9 +59,9 @@ spec:
emptyDir: {}
- name: serving-cert
emptyDir: {}
- name: mtls-certs
- name: tcm-mtls-certs
secret:
secretName: mtls-certs
secretName: tcm-mtls-certs
- name: config
configMap:
name: adapter-configuration
18 changes: 9 additions & 9 deletions chart/templates/hpa.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ spec:
apiVersion: apps/v1
kind: Deployment
name: {{ .Values.worker.deployment }}
minReplicas: 1
maxReplicas: 20
minReplicas: 2
maxReplicas: 50
metrics:
- type: External
external:
Expand All @@ -20,19 +20,19 @@ spec:
temporal_namespace: "{{ .Values.temporal.namespace }}"
target:
type: Value
value: 50m
value: 1500m
behavior:
scaleUp:
stabilizationWindowSeconds: 10
stabilizationWindowSeconds: 5
selectPolicy: Max
policies:
- type: Pods
value: 5
periodSeconds: 10
value: 10
periodSeconds: 5
scaleDown:
stabilizationWindowSeconds: 60
stabilizationWindowSeconds: 5
selectPolicy: Max
policies:
- type: Pods
value: 3
periodSeconds: 30
value: 10
periodSeconds: 5
2 changes: 1 addition & 1 deletion chart/templates/mtls-certificates.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@ data:
kind: Secret
metadata:
creationTimestamp: null
name: mtls-certs
name: tcm-mtls-certs
namespace: {{ .Release.Namespace }}
Binary file added img/diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
45 changes: 16 additions & 29 deletions sample-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,35 +8,22 @@ temporal_cloud:
metrics:
temporal_cloud_sync_match_rate:
query: >
sum(
clamp_min(
(
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_count{}[1m]
)
)
-
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_sync_count{}[1m]
)
)
),
0
(
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_count{
temporal_namespace="123.xyz",
task_queue="hello_world"
}[1m]
)
)
) or vector(0)
temporal_cloud_service_latency:
query: >
sum(
clamp_min(
sum by(temporal_namespace) (
rate(temporal_cloud_v0_service_latency_count{}[1m])
/
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_sync_count{
temporal_namespace="123.xyz",
task_queue="hello_world"
}[1m]
)
-
sum by(temporal_namespace) (
rate(temporal_cloud_v0_service_latency_sum{}[1m])
),
0
)
) or vector(0)
) or vector(1)
Loading