Running WandB on Kubernetes
2020/06/20
Weights & Biases (WandB) is currently my favorite ML tool because it does so many good things I wanted in my workflow. Among these are:
experiment tracking
useful builtin visualizations/dashboard
great one-liner integration (with PyTorch Lightning)
hyperparameter optimization (WandB sweep)
Since I'm running everything on Kubernetes (k8s) now, WandB sweep jobs fit perfectly into the k8s setup. Here's how sweep works:
declare the sweep config to define the search space
initialize a sweep, which will output an agent command
run agents using the command produced to take hyperparameter suggestions from the sweep controller
Step 2 and 3 are usually done manually as shown in the doc. When running on k8s, these should be executed in sequence.
This guide shows how to run a sweep on k8s. We assume basic familiarity with Docker and running things on k8s. First, let's ensure we have the prerequisites for running a sweep job:
a sweep config YAML file
sweep.yaml
a python train script
train.py
for sweep agent to runa Dockerfile to build the image for running the sweep command and sweep agent(s)
a k8s cluster
Running Sweep as a Kubernetes Job
A sweep will run as a Job on k8s. Recall that we want to first run the sweep command, then use its output to run agent(s).
We will use the initContainer to run the sweep command, then save the output to a volume mount. When the job container spawns, we will mount it to the same volume mount to retrieve the agent command, then execute that command. An example k8s Job config sweep-job.yaml
is shown below:
Note that k8s containers cannot yet request fractional GPU. This means that we may want to run multiple agents in a single container. The example above runs 2 agents as shown in the container command ...for i in {1..2}
and the requested number of 2 cpus. Feel free to change those.
To deploy this, simply run:
Users familiar with Helm may also templatize the job config, for instance the env
variable dev
or live
, the number of agents, and the container resources. If you know Helm, this is straightforward to do.
Last updated