访问Rancher-K8S解决方案博主企业合作伙伴 When attempting to restore an RKE2 cluster, it fails due to Rancher directories being unmounted by the rke2-killall.sh script.当尝试恢复 RKE2 集群时由于 rke2-killall.sh 脚本卸载了 Rancher 目录导致恢复失败。After initiating the restore process, the job expects the /var/lib/rancher to be mounted but the rke2-killall explicitly unmount it due to the command being hardcoded within the script itself per here in Kubernetesv1.27.12启动恢复过程后作业预期“/var/lib/rancher”会被挂载但 rke2-killall 因为命令硬编码在脚本中明确卸载了它正如 Kubernetesv1.27.12中所述It will then try to run [Applyinator] Command touch [/var/lib/rancher/rke2/server/db/etcd/tombstone, which fails.然后它会尝试运行“[Applyinator] Command touch [/var/lib/rancher/rke2/server/db/etcd/tombstone”但失败了。This leaves the cluster in a broken state and even performing a cluster reset will not help in this case.这会导致集群处于破损状态即使进行集群重置也无济于事。Some error messages (symptoms) can be seen in the logs as the following日志中可以看到一些错误信息症状如下levelinfo msg[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3 2/dev/null] finished with err: nil and exit code: 127levelinfo 消息 g“[Applyinator] 命令 sh [-c rke2 etcd-snapshot list --etcd-s3 2/dev/null] 以 err nil 和退出代码 结束Resolution 结局Its strongly recommended to upgrade to at least Kubernetesv1.27.16as the issue has been addressed starting from that version.强烈建议至少升级到Kubernetes v1.27.16因为这个问题从该版本开始就已经得到解决。Or, you can apply the following workaround in sequence if youre still on v1.27.12 version:或者如果你还在 v1.27.12 版本也可以依次应用以下变通方法When the restore first fails当恢复第一次失败时1. Go onto each Control Plane node, and comment out the single line in the script rke2-killall1. 进入每个控制平面节点在脚本 rke2-killall 中注释出这行The script is supposed to be under /usr/local/bin脚本应该在 /usr/local/bin 下#do_unmount_and_remove /var/lib/rancher/rke22. execute mount -a on each Control Plane node (as this was removed by the script)2. 在每个控制平面节点上执行“挂载-a”因为脚本已移除该操作3. execute systemctl restart rancher-system-agent on each node.3. 对每个节点执行“systemctl restart Rancher-System-Agent”。This causes it to fetch the machine-plan, and use the already present script, to successfully run or proceed with the restore.这会导致它获取机器计划并使用已有的脚本成功运行或继续恢复。Cause 病因The rke2-killall.sh script unmounts the Rancher directories.rke2-killall.sh 脚本会卸载牧场主目录。Additional Information 附加信息https://github.com/harvester/harvester/issues/4695https://github.com/rancher/rancher/issues/40624Environment 环境Rancher v2.8.5 and less Rancher v2.8.5 及以下版本RKE2 v1.27 and less RKE2 v1.27 及以下版本