Troubleshooting
Machine lifecycle
To get an immediate view of the state of MachineDeployments and Nodes, run:
kubectl -n kube-system get machinedeployment,machineset,machine,no -o wide
NAME AGE DELETED REPLICAS AVAILABLEREPLICAS PROVIDER OS VERSION
machinedeployment.cluster.k8s.io/worker 59d 2 2 openstack ubuntu 1.30.1
NAME AGE DELETED REPLICAS AVAILABLEREPLICAS MACHINEDEPLOYMENT PROVIDER OS VERSION
machineset.cluster.k8s.io/worker-677cf94d4d 8d 0 worker openstack ubuntu 1.30.1
machineset.cluster.k8s.io/worker-77c8c559d6 8d 2 2 worker openstack ubuntu 1.30.1
NAME AGE DELETED MACHINESET ADDRESS NODE PROVIDER OS VERSION
machine.cluster.k8s.io/worker-77c8c559d6-mknks 8d worker-77c8c559d6 192.168.1.20 worker-77c8c559d6-mknks openstack ubuntu 1.30.1
machine.cluster.k8s.io/worker-77c8c559d6-rfd8r 8d worker-77c8c559d6 192.168.1.9 worker-77c8c559d6-rfd8r openstack ubuntu 1.30.1
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node/worker-77c8c559d6-mknks Ready <none> 8d v1.30.1 192.168.1.20 Ubuntu 22.04.4 LTS 5.15.0-107-generic containerd://1.6.33
node/worker-77c8c559d6-rfd8r Ready <none> 8d v1.30.1 192.168.1.9 Ubuntu 22.04.4 LTS 5.15.0-107-generic containerd://1.6.33
Note:
- When the MachineDeployment has converged, all but one of its managed MachineSets have 0 replicas
- All Machines of a given MachineSet share the same name prefix
- Node names are the same as Machine names
- When the Machines have been fully provisioned, there's a Node object for each Machine
Machine Events
To inspect a specific Machine, check its Events:
kubectl -n kube-system events --for machine/worker-77c8c559d6-wvlcg 08/09 17:52
LAST SEEN TYPE REASON OBJECT MESSAGE
9m58s Normal Created Machine/worker-77c8c559d6-wvlcg Successfully created instance
7m26s (x5 over 9m53s) Normal InstanceFound Machine/worker-77c8c559d6-wvlcg Found instance at cloud provider, status: running
6m57s (x2 over 6m59s) Normal LabelsAnnotationsUpdated Machine/worker-77c8c559d6-wvlcg Successfully updated labels/annotations
Get Node conditions
To see the status of different Node conditions, run:
List Pods running on Node
To list all Pods scheduled on a particular Node, run:
Delete Pods running on Node
In case the Node isn't responsive, you may choose to force the immediate deletion of all Pods on a Node:
Caution
When using the --force flag, Kubernetes does wait until the Pods and their containers are terminated.
The applications may continue to run!
Under normal circumstances you should rely on Kubelet to gracefully tear down the Pods.
Get remote access using SSH
Prerequisites
- SSH key agent enabled in the cluster
- SSH public key added to the cluster
- SSH client is configured to use the SSH key
- Either of:
- Floating IP enabled for MachineDeployment
- Other host available as SSH jump host
- Port 22 accessible (default, see node networking)
Note
Adding an SSH key after a Machine is provisioned is only possible if Kubelet on the Node is running and healthy.
-
Get public IP of Node
-
Establish SSH session
Remote access by escaping from a privileged Pod
Prerequisites
- Kubelet running and healthy
- Cluster network functioning (for forwarding stdin & stdout)
Caution
The Pod has full access to the Node! Make sure to verify the integrity of the tooling and container image that is used!
By creating a Pod with a privileged container that's sharing the PID namespace of the host, you can switch to the Kernel namespaces to the init process. To do this, you can use a tool like node-shell.
Inspect Kubelet logs
To tail the logs of Kubelet, run:
Inspect Node bootstrapping with OpenStack
It's possible to get some output of the initialization process of a Node. These logs may contain valuable information on how far the initialization has progressed or surface potentially errors (e.g. DNS).
To show the logs of a Node using the OpenStack CLI:
Node Provisioning
The most important three phases during Node provisioning are:
-
Server creation
To verify this:
- Check Machine Events
-
Query server with OpenStack CLI
-
Node initialization
Issues during this phase can be investigated by inspecting Kubelet logs or the OpenStack console output.
-
Initialize Node daemons
Note
By this point the Node has already been registered with the Kubernetes cluster.
Check:
- Node conditions
- Pods running on Node
-
Get their logs
To get the logs of e.g. the Cilium Pod running on the particular Node, run:
Node deletion
The following steps help to determine why a Node isn't being deleted:
-
Check if the corresponding Machine has a deletion timestamp (
DELETEDcolumn): -
Check if the Node cannot be drained
MetaKube will drain the Node, meaning it's evicting the Pods running on the Node (besides exceptions like DaemonSet Pods). The eviction API attempts to gracefully delete Pods. A Pod may be forbidden to be evicted, e.g. if a matching PodDisruptionBudget doesn't allow any more disruptions.
You can safely try draining the Node yourself:
It may tell you that some Pods are not safe to evict and why.
To get a list of Pods running on a Node, see above:
Another reason why a Node cannot be drained is because the Pods don't leave the
Terminatingstate. This may be because of an unresponsive Kubelet. -
Server can't be deleted
If the Node is fully drained, but it still remains in the cluster, there may be issues with deleting the cloud server. In that case, check the Machine Events for errors.
Autoscaling
For issues related to autoscaling see here.