-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nacos-Server jraft初始化失败,导致集群多节点服务下的实例数不一致,重启节点也无法恢复,最后只能删除data目录 #1118
Comments
我又把问题复现了一下,启动了机器后,故障节点的jraft没什么有用的信息。 但是我登录了ld节点,发现ld节点有错误信息 2024-06-25 00:15:39,043 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.16.7:7848, group: naming_persistent_service] 怀疑是jraft有问题 |
我们的错误好像和#1096 这个是一样的 |
这个错误就是该节点10.254.16.7:7848 从 naming_persistent_service 分组移除了,主动 shutdown 了。 |
为什么10.254.16.7这个节点会移除,这个是我的启动日志 |
10.254.16.7这个节点启动后,LD节点一直报上面的错。 在nacos控制台上看,集群两台节点有都显示一个服务数量时65个,我启动10.254.16.7 follower后(称为节点3), 节点1显示服务实例为65,节点2实例为56,节点三服务实例为40,并且长期无法达到一致性状态。 随着时间的推移,节点2的实例数是波动的,有时候是61,也有时候是50。 实在没有办法,我将节点3(10.254.16.7)shutdown, 节点1,2在短时间内恢复正常,显示服务实例为65 |
这个你可能要问下 nacos,因为 jraft 不会主动去 shutdown 一个 node |
集群环境: 3台ALiyun ECS 16C 32G
Nacos-Server版本: 2.1.2
问题现象:
Nacos-Server3台节点已经正常运行了半个月的时候,但是其中一台因为内存问题,我们不得不将其重启,我们将其命名为1节点,另外两台节点分别为2,3节点。 将1节点重启的方式是执行bin目录下的shutdown脚本,然后执行bin下的startup脚本,这个时候我们发现了问题。 从Nacos控制台查看,1节点显示某一个服务有45个实例,2,3节点显示这个服务有65个实例(后经查实,65个实例是正常的)。 也就是说1节点的数据有问题, 我们查看日志。发现
alipay-jraft日志有错误:
2024-06-19 00:16:35,087 WARN Node <naming_persistent_service/10.254.16.7:7848> RequestVote to 10.254.18.46:7848 error: Status[EINTERNAL<1004>: RPC exception:UNKNOWN].
2024-06-19 00:16:35,707 WARN Fail to issue RPC to 10.254.18.46:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]
2024-06-19 00:16:35,710 WARN Fail to issue RPC to 10.254.18.46:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]
2024-06-19 00:16:35,707 WARN Node <naming_persistent_service_v2/10.254.16.7:7848> RequestVote to 10.254.18.46:7848 error: Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception].
2024-06-19 00:16:35,710 WARN Fail to issue RPC to 10.254.18.46:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]
2024-06-19 00:16:38,277 WARN Fail to issue RPC to 10.254.18.46:7848, consecutiveErrorTimes=11, error=Status[ENOENT<1012>: Peer id not found: 10.254.18.46:7848, group: naming_service_metadata]
2024-06-19 00:18:21,216 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_persistent_service]
2024-06-19 00:18:21,264 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_service_metadata]
2024-06-19 00:18:21,266 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_persistent_service_v2]
2024-06-19 00:18:26,139 WARN Node <naming_instance_metadata/10.254.16.7:7848> RequestVote to 10.254.17.172:7848 error: Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception].
2024-06-19 00:18:26,326 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]
2024-06-19 00:18:26,328 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]
2024-06-19 00:18:26,336 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]
2024-06-19 00:18:28,668 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]
2024-06-19 00:18:31,188 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one]
2024-06-19 00:18:31,360 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one]
2024-06-19 00:18:31,385 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one]
2024-06-19 00:18:31,388 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one]
2024-06-19 00:18:33,710 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]
2024-06-19 00:18:36,225 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]
2024-06-19 00:18:36,400 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]
2024-06-19 00:18:36,424 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]
2024-06-19 00:18:36,449 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]
2024-06-19 00:18:38,786 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=41, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one]
2024-06-19 00:18:41,462 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=41, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_service_metadata]
2024-06-19 00:18:41,477 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=41, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_persistent_service_v2]
2024-06-19 00:18:41,530 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=51, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_instance_metadata]
2024-06-19 00:19:36,094 WARN ThreadId: Replicator [state=Destroyed, statInfo=<running=IDLE, firstLogIndex=171, lastLogIncluded=0, lastLogIndex=171, lastTermIncluded=0>, peerId=10.254.18.46:7848, waitId=2, type=Follower] already destroyed, ignore error code: 1001
2024-06-19 00:19:36,143 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:DEADLINE_EXCEEDED: deadline exceeded after 2.499983956s. [remote_addr=10.254.17.172/10.254.17.172:7848]]
2024-06-19 00:19:36,272 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:DEADLINE_EXCEEDED: deadline exceeded after 2.499984812s. [remote_addr=10.254.17.172/10.254.17.172:7848]]
2024-06-19 00:19:36,303 WARN ThreadId: Replicator [state=Destroyed, statInfo=<running=IDLE, firstLogIndex=3446087, lastLogIncluded=0, lastLogIndex=3446087, lastTermIncluded=0>, peerId=10.254.18.46:7848, waitId=270, type=Follower] already destroyed, ignore error code: 1001
2024-06-19 00:19:36,501 WARN ThreadId: Replicator [state=Destroyed, statInfo=<running=IDLE, firstLogIndex=72, lastLogIncluded=0, lastLogIndex=72, lastTermIncluded=0>, peerId=10.254.18.46:7848, waitId=2, type=Follower] already destroyed, ignore error code: 1001
[admin@b01_nacos_service_test_hk logs]$ cat alipay-jraft.log|grep ERROR
2024-06-19 00:16:35,666 ERROR Fail to connect 10.254.18.46:7848, remoting exception: java.util.concurrent.TimeoutException.
2024-06-19 00:18:26,134 ERROR Fail to connect 10.254.17.172:7848, remoting exception: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception.
2024-06-19 00:18:26,165 ERROR Fail to connect 10.254.17.172:7848, remoting exception: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception.
2024-06-19 00:18:26,165 ERROR Fail to init sending channel to 10.254.17.172:7848.
2024-06-19 00:18:26,165 ERROR Fail to start replicator to peer=10.254.17.172:7848, replicatorType=Follower.
2024-06-19 00:18:26,165 ERROR Fail to add a replicator, peer=10.254.17.172:7848.
Protocol-raft日志错误为:
2024-06-19 00:16:35,175 ERROR Fail to refresh route configuration for group : naming_service_metadata, status is : Status[UNKNOWN<-1>: io.grpc.StatusRuntimeException: UNKNOWN]
2024-06-19 00:18:21,467 ERROR Fail to refresh leader for group : naming_instance_metadata, status is : Status[UNKNOWN<-1>: Unknown leader, No nodes in group naming_instance_metadata, Unknown leader]
2024-06-19 00:18:21,469 ERROR Fail to refresh route configuration for group : naming_instance_metadata, status is : Status[ENOENT<1012>: Fail to find node 10.254.17.172:7848 in group naming_instance_metadata]
我们将1节点shutdown10分钟,然后再次重启,问题仍然没有解决。 我们在社区的Isseus翻找,发现之前人提出的问题,和我们很类似,解决方式是删除data目录,然后重启即可。我们照着做,确实解决了问题,但是如何避免这种问题出现呢
The text was updated successfully, but these errors were encountered: