Setting up a private GPU Kubernetes cluster with K0s
2021/07/05
In this guide, we will use k0s to set up a private multi-node GPU Kubernetes cluster with private Docker registry for ML applications.
Previous guide
It's been over a year since I wrote about setting up a private Kubernetes (k8s) cluster with a private docker registry and GPU support:
Setting up a private Kubernetes clusterThat used Kubeadm to setup Kubernetes, which is quite an involved process. I wanted a faster way to do it, and luckily there's a new tool for that - k0s. It fully sets up Kubernetes from just a few commands.
Setup GPU nodes
We're still setting up our own GPU nodes (machines). See the following guide for instruction.
Ubuntu GPU Server SetupSetup Kubernetes with k0s
This references the official k0s guide and the Mirantis guide, with a few adjustments of my own. Run the following on a node designated as controller, which also doubles as a worker.
1. Download k0s:
2. On the controller node, install a k0s controller that also acts as a worker:
3. Start the installed k0scontroller service, and enable it (so it auto-runs on node restart). This will take a few minutes to run, so check the status in the mean time.
4. When it's ready, check the Kubernetes cluster. You should see the controller node with status "Ready":
5. Next, export the kube config for Kubectl (installation here) to run without needing k0s.
To configure access to the cluster from another machine (e.g. your laptop, or a different node), simply copy the kube config above and replace the server:
https://localhost:6443
part with the local IP of the controller node.
Optional: use Lens to monitor your K8s cluster.
6. Now check that kubectl can run without k0s:
Setup worker nodes
Next, we set up additional worker nodes. We need to generate a token to join the Kubernetes server.
1. On the controller node, generate a worker token, which is just a base64-encoded kube config:
2. On a new worker node, install k0s, and run the join-command in a screen:
3. Again, wait for a few minutes for the worker node to be set up. If you wish, set up the kube config on the worker node too as mentioned in the previous section. Check the cluster again:
Cluster setup for private Docker registry and GPU support
1 . Install a private Docker registry from Helm. This uses 10.96.10.96
as the clusterIP for pushing and pulling images:
When building, pushing and pulling Docker images on any node in the cluster, use the clusterIP for the image name, i.e. image shall be tagged as 10.96.10.96:5000/ORG_NAME/IMAGE_NAME
2. Install the NVIDIA device plugin on your cluster:
Now, we're nearly there. Before we can pull images and allocate GPU pods, we need to configure containerd to support the above.
Configuring containerd
K0s uses containerd instead of Docker for its container runtime. In the previous guide, we had to configure /etc/docker/daemon.json
to:
Repeat the following for each node in the cluster.
Now, we need to do the equivalent for containerd, in its config file /etc/k0s/containerd.toml
. First, get it ready:
1. Create a containerd config file, as per the k0s guide:
2. Update the header of /etc/k0s/containerd.toml
to match the k0s path:
Enable private Docker registry on k0s
Private Docker registry IP
Note that k8s pod image pull cannot use FQDN for in-cluster registry (see link). Use a reserved clusterIP instead.
Reserved clusterIP for registry:
10.96.10.96
In the /etc/k0s/containerd.toml
, we need to:
add a mirror for our clusterIP
allow for HTTP in pulling image from the clusterIP
Here's an example snippet added alongside the existing docker mirror:
Different version of containerd has a different config key format in the toml file. So adopt accordingly to the toml you generated. For example, in an older version you have to replace "io.containerd.grpc.v1.cri"
with cri
.
Enable GPU support on k0s
We have to configure the default runtime to nvidia-container-runtime
to allow GPU container to be allocated in the Kubernetes cluster.
Repeat the following for each node in the cluster.
Kubernetes NVIDIA GPU device plugin
1. In case needed, update your Nvidia driver first:
2. As before, follow the official NVIDIA GPU device plugin until the step to configure runtime
3. As explained in this comment, k8s still needs nvidia-container-runtime
; install it:
4. In /etc/k0s/containerd.toml
find and replace plugins.linux
runtime value runc
with nvidia-container-runtime
, again per the k0s guide:
Different version of containerd has a different config key format in the toml file. So adopt accordingly to the toml you generated. For example, an older variation is:
5. Restart k0s, and describe Kubernetes nodes.
In the kubectl output, you should see nvidia.com/gpu: 1
in the Allocatable section now.
Restart
Finally, restart k0s for these changes to take effect:
You now have a multi-node GPU Kubernetes cluster with a private Docker registry ready for use.
Last updated