Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] agent 的资源会被莫名其妙的修改 #8812

Open
1 task done
kong62 opened this issue Dec 31, 2024 · 4 comments
Open
1 task done

[BUG] agent 的资源会被莫名其妙的修改 #8812

kong62 opened this issue Dec 31, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@kong62
Copy link

kong62 commented Dec 31, 2024

Search before asking

  • I had searched in the issues and found no similar feature requirement.

DeepFlow Component

Agent

What you expected to happen

修改资源配置:

# vim deepflow-6.6.018-config.yaml
  resources:
    limits:
      cpu: 2000m
      memory: 2Gi
    requests:
      cpu: 100m
      memory: 256Mi
# vim agent-group-config.yaml
max_cpus: 2
max_millicpus: 2000
max_memory: 2048
```

刚 upgrade 的时候生效了:
```
# kubectl describe ds -n deepflow deepflow-agent
  Containers:
   deepflow-agent:
    Image:      harborproxy.hupu.io/proxy/deepflow-ce/deepflow-agent:v6.6
    Port:       38086/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     2
      memory:  2Gi
    Requests:
      cpu:     100m
      memory:  256Mi
```

过一会有改回来了,但是发现 requests 中的 memory 从 128 变成了 512:
```
# kubectl describe ds -n deepflow deepflow-agent
    Limits:
      cpu:     1
      memory:  768Mi
    Requests:
      cpu:     100m
      memory:  256Mi
```

### How to reproduce

复现:
1. 通过 edit
```
# kubectl edit ds -n deepflow deepflow-agent
```

2. 通过 helm upgrade

### DeepFlow version

_No response_

### DeepFlow agent list

v6.6.5

### Kubernetes CNI

_No response_

### Operation-System/Kernel version

3.10

### Anything else

_No response_

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [X] I agree to follow this project's [Code of Conduct](https://github.com/deepflowio/deepflow/blob/main/CODE_OF_CONDUCT.md)
@kong62 kong62 added the bug Something isn't working label Dec 31, 2024
@1473371932 1473371932 self-assigned this Jan 2, 2025
@1473371932
Copy link
Contributor

1473371932 commented Jan 2, 2025

在 v6.6 版本后,deepflow-agent pods 的资源限制是通过 agent-group-config 来做的,配置后可通过deepflow-ctl trisolaris.agent-check config --cip [$AGENT_IP] --cmac [$AGENT_MAC] --cid [$DOMAIN_ID] --kwp 1 --gid [$AGENT_GROUP_ID]查看下发到具体某个 agent 的效果

以及,目前我们已经更新了 v6.6 lts 版本,通过更改 deepflow 相关应用 image tag 可完成升级,例如把目前的 v6.6.5 更改为 v6.6 即可,升级后,更新 deepflow-ctl,把下载地址中的 stable 改为 v6.6 即可

在 v6.6 lts 版本中,我们对 agent-group-config 做了详细的优化,使用 max_millicpus 即可

@kong62
Copy link
Author

kong62 commented Jan 3, 2025

@1473371932
那么难道是 agent-group-config 不生效?

# deepflow-ctl agent-group list
NAME      ID             
default   g-660ee5e2c4   

# deepflow-ctl agent-group-config list
NAME      AGENT_GROUP_ID   
default   g-660ee5e2c4   

# deepflow-ctl agent-group-config list g-660ee5e2c4 -o yaml
vtap_group_id: g-660ee5e2c4
max_collect_pps: 1000000
max_npb_bps: 100000
max_cpus: 2
max_millicpus: 2000
max_memory: 2048
capture_packet_size: 65535
log_threshold: 10000
l7_log_packet_size: 1024
l4_log_collect_nps_threshold: 1000000
l7_log_collect_nps_threshold: 1000000
thread_threshold: 1000
process_threshold: 10
log_file_size: 10000

这里加载了一个全局配置:

# kubectl logs -n deepflow deepflow-agent-crwkg
[2024-12-31 16:59:04.671355 +08:00] INFO [src/config/handler.rs:2564] UserConfig {
    global: Global {
        limits: Limits {
            max_millicpus: 1000,
            max_memory: 805306368,
            max_log_backhaul_rate: 300,
            max_local_log_file_size: 1048576000,
            local_log_retention: 25920000s,
        },

查看服务端日志,找不到 agent config:

# kubectl logs -n deepflow deepflow-server-5c57b9dd88-tjf2b | grep 'agent config'
2024-12-31 16:40:06.137 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 16:41:27.475 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 16:41:27.543 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 16:45:06.165 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 16:50:06.164 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 16:55:06.162 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:00:06.163 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:05:06.164 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:10:06.162 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:15:06.163 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:20:06.163 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:25:06.164 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:30:06.163 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config

这里的 VTAP_LCUUIDS 应该就是 node 节点的 ID,SHORT_UUID 是 agent-group 的短 ID,LCUUID 是 agent-group 的长 ID:

# deepflow-ctl agent-group list -o yaml
- DISABLE_VTAP_LCUUIDS: []
  ID: 1
  LCUUID: 660ee2d8-c403-11ef-ba73-4adcd3115ab5
  LICENSE_FUNCTIONS: []
  NAME: default
  PENDING_VTAP_LCUUIDS: []
  SHORT_UUID: g-660ee5e2c4
  TEAM_ID: 1
  UPDATED_AT: "2024-12-27 11:33:54"
  USER_ID: 1
  VTAP_LCUUIDS:
  - fbbb552c-dd45-427f-b58d-1ba08dfd34b8
  - 8684d066-ef22-4e08-b498-e2b43009d627
  - fbb6a2b6-d51f-4995-a568-b6b9f142d745
  - 3b441f9d-54c5-4dea-a49c-9b46909d0626
  - b4f8b00d-885f-420f-ba94-dc33588ea367
  - 595a994b-982c-4a6d-95f1-f47be4c8e727
  - 1b55322d-85d4-4ee6-be3f-e7914cdbb584
  - ba7bb6de-e1fb-4cc2-b77d-5ddfbd337367
  - 9e807bbf-5928-4abd-b65d-620b384b9455
  - 37031049-c863-4bf3-971d-76c7e3f2b579
  - 7a81e434-57fd-44a4-97c4-77c7bf3a2028
  - 562a61b6-3b49-47cb-b291-ff73572ef4ef
  - 32bb26f7-c25c-4a9c-a5d6-29d729097851
  - a9e7847c-3f3d-4fc3-804b-7b05508df23f
  - 1975390d-1ced-4c18-90c3-05433cf371c0
  - a1274ad7-1cc2-4eda-afed-30e2483fe9d0
  - b04ed8a5-d333-47f4-83b3-59f221bef1a9
  - 149a3ea6-f084-42d2-a18c-b21b5d9b8681
  - e0d527f7-2d72-4268-9222-9590f6c75f5b
  - cd6d2ede-5655-4bfb-8d1c-27ffb4d5d1c6
  # deepflow-ctl trisolaris.check config
request trisolaris(10.45.10.6:30035), params({CtrlIP: CtrlMac: GroupID: ClusterID: TeamID: RpcIP:10.45.10.6 RpcPort:30035 Type:trident PluginType:wasm PluginName: OrgID:1})
config:
enabled: false
max_memory: 768
platform_enabled: true
max_escape_seconds: 3600
trident_type: TT_PUBLIC_CLOUD

revision: 
self_update_url: 

Analyzerconfig:
<nil>

@1473371932
Copy link
Contributor

1473371932 commented Jan 8, 2025

@1473371932 那么难道是 agent-group-config 不生效?

# deepflow-ctl agent-group list
NAME      ID             
default   g-660ee5e2c4   

# deepflow-ctl agent-group-config list
NAME      AGENT_GROUP_ID   
default   g-660ee5e2c4   

# deepflow-ctl agent-group-config list g-660ee5e2c4 -o yaml
vtap_group_id: g-660ee5e2c4
max_collect_pps: 1000000
max_npb_bps: 100000
max_cpus: 2
max_millicpus: 2000
max_memory: 2048
capture_packet_size: 65535
log_threshold: 10000
l7_log_packet_size: 1024
l4_log_collect_nps_threshold: 1000000
l7_log_collect_nps_threshold: 1000000
thread_threshold: 1000
process_threshold: 10
log_file_size: 10000

这里加载了一个全局配置:

# kubectl logs -n deepflow deepflow-agent-crwkg
[2024-12-31 16:59:04.671355 +08:00] INFO [src/config/handler.rs:2564] UserConfig {
    global: Global {
        limits: Limits {
            max_millicpus: 1000,
            max_memory: 805306368,
            max_log_backhaul_rate: 300,
            max_local_log_file_size: 1048576000,
            local_log_retention: 25920000s,
        },

查看服务端日志,找不到 agent config:

# kubectl logs -n deepflow deepflow-server-5c57b9dd88-tjf2b | grep 'agent config'
2024-12-31 16:40:06.137 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 16:41:27.475 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 16:41:27.543 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 16:45:06.165 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 16:50:06.164 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 16:55:06.162 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:00:06.163 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:05:06.164 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:10:06.162 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:15:06.163 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:20:06.163 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:25:06.164 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config
2024-12-31 17:30:06.163 [ERRO] [trisolaris.vtap] vtap.go:568 vtap group lcuuid(660ee2d8-c403-11ef-ba73-4adcd3115ab5) not found agent config

这里的 VTAP_LCUUIDS 应该就是 node 节点的 ID,SHORT_UUID 是 agent-group 的短 ID,LCUUID 是 agent-group 的长 ID:

# deepflow-ctl agent-group list -o yaml
- DISABLE_VTAP_LCUUIDS: []
  ID: 1
  LCUUID: 660ee2d8-c403-11ef-ba73-4adcd3115ab5
  LICENSE_FUNCTIONS: []
  NAME: default
  PENDING_VTAP_LCUUIDS: []
  SHORT_UUID: g-660ee5e2c4
  TEAM_ID: 1
  UPDATED_AT: "2024-12-27 11:33:54"
  USER_ID: 1
  VTAP_LCUUIDS:
  - fbbb552c-dd45-427f-b58d-1ba08dfd34b8
  - 8684d066-ef22-4e08-b498-e2b43009d627
  - fbb6a2b6-d51f-4995-a568-b6b9f142d745
  - 3b441f9d-54c5-4dea-a49c-9b46909d0626
  - b4f8b00d-885f-420f-ba94-dc33588ea367
  - 595a994b-982c-4a6d-95f1-f47be4c8e727
  - 1b55322d-85d4-4ee6-be3f-e7914cdbb584
  - ba7bb6de-e1fb-4cc2-b77d-5ddfbd337367
  - 9e807bbf-5928-4abd-b65d-620b384b9455
  - 37031049-c863-4bf3-971d-76c7e3f2b579
  - 7a81e434-57fd-44a4-97c4-77c7bf3a2028
  - 562a61b6-3b49-47cb-b291-ff73572ef4ef
  - 32bb26f7-c25c-4a9c-a5d6-29d729097851
  - a9e7847c-3f3d-4fc3-804b-7b05508df23f
  - 1975390d-1ced-4c18-90c3-05433cf371c0
  - a1274ad7-1cc2-4eda-afed-30e2483fe9d0
  - b04ed8a5-d333-47f4-83b3-59f221bef1a9
  - 149a3ea6-f084-42d2-a18c-b21b5d9b8681
  - e0d527f7-2d72-4268-9222-9590f6c75f5b
  - cd6d2ede-5655-4bfb-8d1c-27ffb4d5d1c6
  # deepflow-ctl trisolaris.check config
request trisolaris(10.45.10.6:30035), params({CtrlIP: CtrlMac: GroupID: ClusterID: TeamID: RpcIP:10.45.10.6 RpcPort:30035 Type:trident PluginType:wasm PluginName: OrgID:1})
config:
enabled: false
max_memory: 768
platform_enabled: true
max_escape_seconds: 3600
trident_type: TT_PUBLIC_CLOUD

revision: 
self_update_url: 

Analyzerconfig:
<nil>

感觉有可能是两个 cpu 限制参数一起用导致的,我用最新的 LTS 版本测试了一下:registry.cn-beijing.aliyuncs.com/deepflow-ce/deepflow-agent:v6.6 更新组配置后:

global:
  limits:
    max_millicpus: 2000
    max_memory: 4096

等待一段时间后没有复现这种情况:

kubectl describe ds -n deepflow deepflow-agent | grep 'Limits:' -A2
    Limits:
      cpu:     2
      memory:  4Gi

能否更新一下版本,可能在之前的小版本中存在一些问题,但后续的 LTS 版本修复了,以及其他几个 deepflow 相关的组件,可以把 image tag 都更新到 v6.6 版本

@kong62
Copy link
Author

kong62 commented Jan 8, 2025

@1473371932 感谢帮助。找到问题了,是因为当时稳定版的 deepflow-ctl 是 6.6.5,配置是老的,不生效,后来换成 6.6 就好了。

最近测试过程中,发现关于版本控制的问题,有点混乱:
GitHub 停留在 6.6.5 版本
官方文档 Release Notes 为 v6.6.9
官方版本发布时间线最新为 7.0.0
官方提供的部分文档和示例版本较旧。特别是基于阿里云的镜像仓库,不知道最新的版本是什么,只能通过文档拿到版本。
镜像提供 6.6 和 6.6.8 tag,找不到 6.6.9 的,deepflow-ctl 只能找到 6.6 和 6.6.5 版本,其中 6.6 会不停更新,相当于 latest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants