Build a custom Hadron image with NVIDIA drivers
Hadron Linux is a musl-based, from-scratch distribution. NVIDIA's pre-built driver containers target specific glibc distributions (Ubuntu, RHEL, etc.) and do not support Hadron out of the box. The correct approach is to compile NVIDIA open kernel modules against the exact Hadron kernel version and bake them — along with the NVIDIA userspace tools and a glibc shim — into a custom OCI image.
The kairos-io/hadron repository provides
examples/add-packages/Dockerfile.nvidia which does exactly this.
Choosing the right base image​
The hadron-extension final stage uses BASE_IMAGE as its base. The right value depends on
what you plan to do with the image:
| Use case | BASE_IMAGE |
|---|---|
| Build a bootable ISO / sysextension with Auroraboot | ghcr.io/kairos-io/hadron |
Upgrade a running Kairos node (via NodeOpUpgrade) | The full Kairos image from quay.io/kairos |
When upgrading a running node with the Kairos operator, the upgrade pod runs kairos-agent
from inside your image. The raw ghcr.io/kairos-io/hadron base does not include
kairos-agent; using it as the base will fail at upgrade time with
kairos-agent: command not found.
To find the correct full Kairos image tag for your running node:
kairos-agent upgrade list-releases | grep hadron
Example output:
quay.io/kairos/hadron:v0.0.4-standard-amd64-generic-v4.0.3-k3s-v1.35.2-k3s1
Use that full image reference as your BASE_IMAGE + BASE_IMAGE_TAG.
Build arguments​
| Argument | Purpose |
|---|---|
HADRON_VERSION | Hadron release tag — used for the hadron-toolchain image. Must match the kernel on your target node. |
BASE_IMAGE | Registry + image name for the final base (see table above). |
BASE_IMAGE_TAG | Full tag of the base image (may differ from HADRON_VERSION for full Kairos images). |
NVIDIA_VERSION | NVIDIA open-gpu-kernel-modules version (e.g. 580.126.20). |
JOBS | Parallelism for compilation ($(nproc) on native builds). |
KERNEL_ARCH | Kernel ARCH value — must be x86_64 (not amd64) for x86_64 targets. |
HADRON_VERSION and BASE_IMAGE_TAG serve different purposes. HADRON_VERSION pins the
toolchain (and thus the kernel version the modules are compiled against). BASE_IMAGE_TAG is
the tag of the OS image layered into the final output. Keep HADRON_VERSION matching the
Hadron flavor release (e.g. v0.0.4); BASE_IMAGE_TAG may be the longer Kairos release tag.
What the image contains​
The build produces a Kairos OS image with these additions on top of the base:
| Component | Source | Purpose |
|---|---|---|
NVIDIA open kernel modules (.ko.zst) | Compiled from source against Hadron kernel | modprobe nvidia |
nvidia-smi, nvidia-cuda-mps-* | NVIDIA redistributable driver | GPU management tools |
nvidia-modprobe | NVIDIA redistributable | Creates /dev/nvidia* device nodes at boot |
glibc runtime (libc.so.6, ld-linux-*.so) | Ubuntu 24.04 | Lets glibc-linked NVIDIA binaries run on musl |
NVIDIA compute libraries (libcuda.so, libnvidia-ml.so, …) | NVIDIA redistributable | CUDA / ML frameworks |
| NVIDIA firmware | NVIDIA redistributable | GPU firmware blobs |
/etc/modprobe.d/blacklist-nouveau.conf | Generated | Prevents nouveau from loading |
/etc/udev/rules.d/71-nvidia.rules | Generated | Triggers nvidia-modprobe on module load |
nvidia-modprobe and the udev rule are requiredThe NVIDIA kernel modules load at boot, but on Hadron (a from-scratch distro) the standard
distro udev rules for creating NVIDIA device nodes are not present. Without nvidia-modprobe
being called after the modules load, /dev/nvidia0, /dev/nvidia-uvm and friends are never
created. The result is that nvidia-smi starts but reports "couldn't communicate with the
NVIDIA driver" even though lsmod | grep nvidia shows the modules as loaded.
The image installs two udev rules in /etc/udev/rules.d/71-nvidia.rules:
SUBSYSTEM=="module", ACTION=="add", KERNEL=="nvidia", RUN+="/usr/bin/nvidia-modprobe -c 0"
SUBSYSTEM=="module", ACTION=="add", KERNEL=="nvidia_uvm", RUN+="/usr/bin/nvidia-modprobe -u"
Do not combine the flags into a single nvidia-modprobe -c 0 -u call: -u means "act
on the UVM module instead of the GPU module", so it suppresses /dev/nvidia0 creation
entirely and the GPU stays inaccessible to userspace.
On Hadron the filesystem layout is:
/sbin -> usr/bin
/usr/sbin -> bin
Both /sbin/* and /usr/sbin/* resolve to /usr/bin/*. Never create a real
/usr/sbin/ directory in a builder stage that you COPY --link into the final image — for
example with mkdir -p ${OUTPUT}/usr/sbin && cp something ${OUTPUT}/usr/sbin/.... BuildKit's
--link mode overlays the directory entry and replaces the base image's /usr/sbin → bin
symlink with your (almost-empty) real directory, hiding every other binary normally reachable
through /usr/sbin (modprobe, init, systemctl, iptables, …).
The breakage is silent and far-reaching:
- The kernel reads
/proc/sys/kernel/modprobe(=/usr/sbin/modprobe) to handlerequest_module(), so module auto-loading silently fails. Cilium then crashes withfailed to add veth pair: operation not supported, thenode.cilium.io/agent-not-ready:NoScheduletaint sticks, the GPU operator's daemonsets never schedule, andnvidia.com/gpunever gets advertised. - SSH closes connections after key acceptance because PAM hits
pam_access.so/pam_time.sowithout the missing helper binaries.
The included Dockerfile places ldconfig at ${OUTPUT}/usr/bin/ldconfig for exactly this
reason: /usr/bin is a real directory in the base image and merges cleanly, and
/sbin/ldconfig, /usr/sbin/ldconfig and /usr/bin/ldconfig all still resolve to it.
ldconfig ships in the image at allThe NVIDIA container runtime hook bind-mounts the host's /sbin/ldconfig into every GPU
container so the container can rebuild its ld.so.cache. Hadron is musl-based and has no
glibc ldconfig, so without this file every GPU pod fails to start with
stat /sbin/ldconfig: no such file or directory from the OCI createContainer hook.
The Dockerfile copies /sbin/ldconfig.real (the real glibc binary) from the Ubuntu 24.04
builder, not /sbin/ldconfig — the latter is a 387-byte wrapper script that delegates
to /sbin/ldconfig.real, which does not exist on Hadron.
Building the image​
Native Linux x86_64 (local build)​
HADRON_VERSION="v0.0.4"
NVIDIA_VERSION="580.126.20"
BASE_IMAGE="quay.io/kairos/hadron"
BASE_IMAGE_TAG="v0.0.4-standard-amd64-generic-v4.0.3-k3s-v1.35.2-k3s1"
IMAGE="my-registry.example.com/myteam/hadron-nvidia:${HADRON_VERSION}"
docker buildx build \
-f examples/add-packages/Dockerfile.nvidia \
--build-arg HADRON_VERSION="${HADRON_VERSION}" \
--build-arg BASE_IMAGE="${BASE_IMAGE}" \
--build-arg BASE_IMAGE_TAG="${BASE_IMAGE_TAG}" \
--build-arg NVIDIA_VERSION="${NVIDIA_VERSION}" \
--build-arg KERNEL_ARCH="x86_64" \
--build-arg JOBS="$(nproc)" \
--target hadron-extension \
-t "${IMAGE}" \
--push \
.
The build downloads the kernel source (~100 MB) and compiles NVIDIA open kernel modules from source — expect 15–30 minutes on a 4-core machine.
GitHub Actions (for Apple Silicon or CI)​
Building NVIDIA kernel modules via QEMU emulation on non-native hosts is impractical. GitHub
Actions provides free x86_64 runners with docker buildx pre-installed.
Create .github/workflows/build-hadron-nvidia.yml in your fork of kairos-io/hadron:
name: Build Hadron+NVIDIA image
on:
workflow_dispatch:
inputs:
hadron_version:
description: Hadron release tag
default: v0.0.4
required: true
nvidia_version:
description: NVIDIA open-gpu-kernel-modules version
default: '580.126.20'
required: true
jobs:
build:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Log in to GHCR
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build and push
uses: docker/build-push-action@v6
with:
context: .
file: examples/add-packages/Dockerfile.nvidia
target: hadron-extension
build-args: |
HADRON_VERSION=${{ inputs.hadron_version }}
NVIDIA_VERSION=${{ inputs.nvidia_version }}
JOBS=4
KERNEL_ARCH=x86_64
BASE_IMAGE=quay.io/kairos/hadron
BASE_IMAGE_TAG=v0.0.4-standard-amd64-generic-v4.0.3-k3s-v1.35.2-k3s1
push: true
tags: ghcr.io/${{ github.repository_owner }}/hadron-nvidia:${{ inputs.hadron_version }}
cache-from: type=gha
cache-to: type=gha,mode=max
Trigger it from the Actions tab → Run workflow, then make the resulting GHCR package public so your cluster nodes can pull it without credentials.
Upgrading a running Kairos node with NodeOpUpgrade​
Use the Kairos operator to upgrade a node in-place. The operator creates a privileged pod on
the target node that runs kairos-agent from inside the container — this is important
because it creates the squashfs for the new active partition directly from the running container
filesystem, preserving all symlinks (including /boot/vmlinuz) correctly. Running
kairos-agent upgrade --source oci: directly on the host unpacks OCI layers separately and
can break those symlinks, causing a GRUB boot failure (invalid magic number).
apiVersion: operator.kairos.io/v1alpha1
kind: NodeOpUpgrade
metadata:
name: upgrade-worker-hadron-nvidia
namespace: default
spec:
image: ghcr.io/<your-org>/hadron-nvidia:v0.0.4
upgradeActive: true
upgradeRecovery: false
force: true
nodeSelector:
matchLabels:
kubernetes.io/hostname: <your-gpu-node-name>
force: true is requiredThe operator compares the version string in /etc/kairos-release of the running node against
the image. Because the custom NVIDIA image is based on the same Kairos release as the running
node, the versions match and the operator exits with "Up to date" without writing anything.
force: true bypasses the version check and always performs the upgrade.
Apply it and watch progress:
kubectl apply -f nodeopupgrade.yaml
kubectl get pods -A | grep upgrade
The node will reboot once. After it comes back, verify NVIDIA is functional:
ssh kairos@<node-ip> "nvidia-smi"
Expected output shows the GPU name, driver version 580.126.20, and temperature.
Installing the NVIDIA GPU Operator (driver-less mode)​
With NVIDIA kernel modules already in the OS image, install the GPU Operator with
driver.enabled=false so it manages only the device plugin and feature discovery — not the
driver:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--version v25.10.1 \
--set driver.enabled=false \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set gfd.enabled=true \
--set nfd.enabled=true
Keep toolkit.enabled=true: the toolkit container configures the host's containerd /
CRI-O runtime to invoke the NVIDIA OCI hook for GPU containers. With drivers in the OS
image, the toolkit no longer needs to install the driver (driver.enabled=false), but
it still needs to wire up the runtime.
Use GPU Operator v25.10.1, not v26.x. Version 26 ships nvidia-ctk 1.19 which conflicts
with the host toolkit version bundled in the Hadron image, causing CDI hook failures.
Verify the GPU resource is registered:
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'
A GPU-enabled node should show 1 (or more) in the nvidia.com/gpu column.
Validating end-to-end with a CUDA workload​
The operator's bundled nvidia-cuda-validator pod runs automatically as part of the
nvidia-operator-validator init sequence and is a good first signal — if you see
nvidia-cuda-validator-xxxxx in gpu-operator namespace with STATUS: Completed, basic
CUDA compute already works.
For an explicit, repeatable test deploy NVIDIA's vectorAdd sample requesting
nvidia.com/gpu: 1:
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubi8
resources:
limits:
nvidia.com/gpu: 1
EOF
kubectl wait --for=condition=Ready pod/cuda-vectoradd --timeout=120s || true
kubectl logs cuda-vectoradd
Expected output:
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
The pod will end up in Completed (it's a one-shot CUDA program). If it's stuck in
Pending because nothing tolerates the GPU schedule, check that the worker carries the
nvidia.com/gpu=true label and the gpu-operator daemonset pods on it are all
Running.
Clean up:
kubectl delete pod cuda-vectoradd