Version Upgrade Guide

End-to-end procedure for upgrading all homelab components: Talos, platform tools, cloud services, and workloads.

Version upgrades follow a strict layer order — each layer depends on the one below it being stable first.


Current Version Inventory

Branch: v0.1.6 — code targets below. Some require Pulumi/kubectl apply to take effect on the live cluster.

Infrastructure & Platform

Componentv0.1.6 TargetLiveFile
Talosv1.13.3v1.13.3 ✓core/platform/talos.go:17
Cilium1.19.41.19.4 ✓core/platform/cilium.go:30
ArgoCD chart9.5.159.5.15 ✓core/platform/argocd.go:24
ArgoCD manifests branchv0.1.6-manifestspatched livecore/platform/argocd.go:175,190
cert-managerv1.20.2v1.19.3 ✓workloads/cdk8s.yaml
k8s API target1.35.01.35.0 ✓workloads/cdk8s.yaml

Cloud (Bifrost Docker Compose)

Componentv0.1.6 TargetFile
Traefikv3.7.1core/cloud/bifrost/docker-compose.yml:35
NetBird server0.71.4docker-compose.yml:59
NetBird dashboardv0.71.4docker-compose.yml:73
NetBird reverse-proxyv0.71.4docker-compose.yml:82
NetBird agent (Bifrost)0.71.4docker-compose.yml:94
Authentik2026.5.2docker-compose.yml:137,168
PostgreSQL (Authentik)16.14-alpinedocker-compose.yml:115
Gatusv5.36.0docker-compose.yml — uptime monitoring at uptime.madhan.app

Secrets & Storage

Componentv0.1.6 TargetFile
OpenBao helm0.28.3workloads/secrets/openbao.go:42
OpenBao image2.5.4workloads/secrets/openbao.go:98
CSI Driver1.6.0workloads/cdk8s.yaml
Longhorn1.11.2workloads/cdk8s.yaml
CNPG operator0.28.2workloads/databases/cnpg.go:31

Workloads

Componentv0.1.6 TargetFile
VictoriaMetrics k8s-stack0.80.0workloads/cdk8s.yaml
VictoriaLogs0.12.5workloads/cdk8s.yaml
Grafana12.4.1workloads/cdk8s.yaml
Metrics Server3.13.0workloads/cdk8s.yaml
OTel Collector0.156.2workloads/observability/otel_collector.go:155,213
Harbor1.19.0workloads/cdk8s.yaml
NVIDIA GPU Operator (import)v26.3.1workloads/cdk8s.yaml
NVIDIA Device Plugin0.19.1workloads/hardware/nvidia_gpu_operator.go:47
DCGM Exporter4.8.2workloads/hardware/nvidia_gpu_operator.go:99
n8n (8gears OCI)2.0.1workloads/automation/n8n.go:184
Ollama chart1.57.0workloads/cdk8s.yaml
Ollama image0.24.0workloads/ai/ollama.go:33
ComfyUI imagecu128-megapak-20260223workloads/ai/comfyui.go:73
Headlamp0.42.0workloads/cdk8s.yaml
Kyverno3.8.1workloads/cdk8s.yaml
Trivy Operator0.32.1workloads/security/trivy.go:22 + cdk8s.yaml
Falco8.0.5workloads/security/falco.go:84
Reloader2.2.12workloads/support/reloader.go:20
NetBird peer (k8s)0.71.4workloads/networking/netbird_peer.go:139,158

Upgrade Order

Components must be upgraded in layer order. Never skip layers.

Talos
  └─► Cilium  (CNI must be compatible with Talos k8s version)
        └─► Gateway API CRDs  (bundled with Cilium chart)
              └─► ArgoCD  (GitOps engine)
                    └─► cert-manager
                          └─► OpenBao + CSI Driver  (secrets layer; all apps depend on this)
                                └─► Longhorn + CNPG  (storage; stateful apps depend on this)
                                      └─► Workloads  (batched by risk)

Bifrost docker-compose  (independent — upgrade anytime)

Phase 0 — Research Latest Versions

Run before touching anything. Record results and substitute into the phases below.

# Infrastructure
curl -s https://api.github.com/repos/siderolabs/talos/releases/latest | jq -r .tag_name
helm repo add cilium https://helm.cilium.io && helm search repo cilium/cilium --versions | head -3
helm repo add argo https://argoproj.github.io/argo-helm && helm search repo argo/argo-cd --versions | head -3

# Add all workload chart repos
helm repo add longhorn    https://charts.longhorn.io
helm repo add vm          https://victoriametrics.github.io/helm-charts
helm repo add grafana     https://grafana-community.github.io/helm-charts
helm repo add harbor      https://helm.goharbor.io
helm repo add openbao     https://openbao.github.io/openbao-helm
helm repo add csi-driver  https://kubernetes-sigs.github.io/secrets-store-csi-driver/charts
helm repo add headlamp    https://kubernetes-sigs.github.io/headlamp
helm repo add kyverno     https://kyverno.github.io/kyverno
helm repo add trivy       https://aquasecurity.github.io/helm-charts
helm repo add otel        https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo add ollama      https://otwld.github.io/ollama-helm
helm repo add reloader    https://stakater.github.io/stakater-charts
helm repo add cnpg        https://cloudnative-pg.github.io/charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server
helm repo update

for chart in longhorn/longhorn \
  vm/victoria-metrics-k8s-stack vm/victoria-logs-single \
  grafana/grafana harbor/harbor openbao/openbao \
  csi-driver/secrets-store-csi-driver \
  headlamp/headlamp kyverno/kyverno trivy/trivy-operator \
  otel/opentelemetry-collector ollama/ollama reloader/reloader \
  cnpg/cloudnative-pg metrics-server/metrics-server; do
  echo "=== $chart ===" && helm search repo "$chart" --versions | head -2
done

# Container images
curl -s https://api.github.com/repos/netbirdio/netbird/releases/latest | jq -r .tag_name
curl -s https://api.github.com/repos/goauthentik/authentik/releases/latest | jq -r .tag_name
curl -s https://api.github.com/repos/traefik/traefik/releases/latest | jq -r .tag_name
curl -s https://hub.docker.com/v2/repositories/ollama/ollama/tags?page_size=3 | jq -r '.results[].name'

Phase 1 — Talos

Risk: High — rolling node restart.

// core/platform/talos.go:17
talosVersion = "vX.Y.Z"

Rules:

  • Upgrade one minor version at a time (1.12 → 1.13, not 1.12 → 1.15)
  • Control planes upgrade first, workers after all CPs are healthy
  • Check k8s version embedded in the new Talos release — if it bumps (e.g., 1.30 → 1.31), update k8s@1.30.0 in workloads/cdk8s.yaml and re-run cdk8s import
just core talos up
talosctl --talosconfig ~/.talos/config health --nodes 192.168.1.210

Phase 2 — Cilium

Risk: High — CNI restart interrupts pod networking briefly.

// core/platform/cilium.go:30
Version: pulumi.String("1.X.Y"),

Rules:

  • Verify compatibility with new Talos k8s version: https://docs.cilium.io/en/stable/network/kubernetes/compatibility/
  • wt0 must not be added to Cilium devices — keep only eth0 in cilium.go. See NetBird routing notes.
  • After upgrade check CiliumLoadBalancerIPPool and BGPCiliumPeeringPolicy CR specs in cilium.go:208,225 for field renames
just core platform up
kubectl -n kube-system rollout status daemonset/cilium
kubectl get gateway -n kube-system homelab-gateway

Phase 3 — ArgoCD

Risk: Medium — GitOps engine downtime during pod restart.

// core/platform/argocd.go:24
Version: pulumi.String("9.X.Y"),

TLSRoute service name — critical: The TLSRoute backend at argocd.go:148 hardcodes a Helm-generated service name argo-cd-964152f1-argocd-server. The hash suffix may change on chart upgrade. After just core platform up, verify and update if needed:

kubectl get svc -n argocd | grep argocd-server
# Update argocd.go:148 if the name changed, then re-run just core platform up

ApplicationSet UI: Already enabled. ArgoCD chart 9.x (ArgoCD 2.14.x) shows ApplicationSets under the top-level ApplicationSets tab in the UI. No config changes needed.

just core platform up
kubectl rollout status deployment argocd-server -n argocd

Phase 4 — cert-manager

Risk: Low — but high blast radius if broken (TLS for all services).

# workloads/cdk8s.yaml
- helm:https://charts.jetstack.io/cert-manager@1.X.Y
just synth && git push
kubectl get certificaterequests -A  # must all show Ready=True

Phase 5 — OpenBao + CSI Driver

Risk: Medium — all running apps read secrets through this layer.

// workloads/secrets/openbao.go:42
Version: jsii.String("0.X.Y"),
// workloads/secrets/openbao.go:98
"image": "openbao/openbao:2.X.Y",
# workloads/cdk8s.yaml
- helm:https://openbao.github.io/openbao-helm/openbao@0.X.Y
- helm:https://kubernetes-sigs.github.io/secrets-store-csi-driver/charts/secrets-store-csi-driver@1.X.Y

Helm chart version and image tag must be compatible — check OpenBao release notes.

just synth && git push
kubectl logs -n openbao -l app.kubernetes.io/name=openbao -c unseal
kubectl exec -n openbao openbao-0 -- bao status  # sealed: false
kubectl exec -n grafana deploy/grafana -- cat /mnt/secrets/ADMIN_PASSWORD  # smoke test

Phase 6 — Longhorn + CNPG

Risk: Medium — storage disruption possible.

Longhorn

# workloads/cdk8s.yaml
- helm:https://charts.longhorn.io/longhorn@1.X.Y

One minor version at a time only (e.g., 1.10 → 1.11, not 1.10 → 1.12 directly).

just synth && git push
kubectl get nodes.longhorn.io -n longhorn-system
kubectl get volume.longhorn.io -n longhorn-system  # all must be healthy

CNPG

// workloads/databases/cnpg.go:31
Version: jsii.String("0.X.Y"),

CNPG operator upgrades are backwards-compatible with existing Cluster CRs.

kubectl get cluster -A  # all clusters healthy

Phase 7 — Bifrost Docker Compose

Risk: Low — independent of k8s cluster.

# core/cloud/bifrost/docker-compose.yml
traefik:vX.Y                               # line 35
netbirdio/netbird-server:0.X.Y             # line 59
netbirdio/dashboard:0.X.Y                  # line 73  (was: latest)
netbirdio/reverse-proxy:0.X.Y             # line 82  (was: latest)
netbirdio/netbird:0.X.Y                    # line 94
ghcr.io/goauthentik/server:20XX.X.X        # lines 137, 168
postgres:16.X-alpine                       # line 115  (minor bumps only)

NetBird rule: All four NetBird components (server, dashboard, reverse-proxy, agent on Bifrost) must be on the same version. Also update workloads/networking/netbird_peer.go:139,158 to match.

Authentik migration race (fixed in bootstrap.sh): On Authentik upgrades, bootstrap.sh now runs ak migrate explicitly (via a one-off server container) before starting authentik-server and authentik-worker. This prevents a crash-loop where the server queries new ORM columns that haven't been added yet. If just core hetzner up fails at the Authentik health check step, SSH in and run:

docker exec authentik-server ak migrate
docker restart authentik-server authentik-worker

NetBird ip rules lost on container restart: After restarting netbird-agent, verify ip rules are restored — ip rule show must show rules for table 7120. If missing, restart the container again: docker restart netbird-agent. This restores the 192.168.1.0/24 policy route needed for Traefik → k8s traffic.

just core hetzner up
# Verify NetBird mesh:
curl -sk https://netbird.madhan.app/api/v1/peers -H "Authorization: Bearer $NB_TOKEN"
# Verify ip rules on bifrost host:
ssh root@178.156.199.250 "ip rule show | grep 7120"

Phase 8 — Workloads: Low Risk

No inter-dependencies. Update all in one commit, run just synth, push.

ComponentChange
Kyvernocdk8s.yaml
Falcosecurity/falco.go:84
Metrics Servercdk8s.yaml
Reloadersupport/reloader.go:20
Fleetcdk8s.yaml
OTel Collectorobservability/otel_collector.go:168,227 — both agent + gateway releases
Headlampcdk8s.yaml
Trivycdk8s.yaml and security/trivy.go:22 — both must match; CRD fetch URL is built from this version
just synth && git push
kubectl get applications -n argocd  # all Synced + Healthy

Phase 9 — Workloads: Medium Risk

VictoriaMetrics + VictoriaLogs

# workloads/cdk8s.yaml
- helm:https://victoriametrics.github.io/helm-charts/victoria-metrics-k8s-stack@0.X.Y
- helm:https://victoriametrics.github.io/helm-charts/victoria-logs-single@0.X.Y

Also update workloads/observability/victoria_metrics.go:69. Run helm diff upgrade first — the values schema changes frequently between minor versions.

Grafana

# workloads/cdk8s.yaml
- helm:https://grafana-community.github.io/helm-charts/grafana@X.Y.Z

Grafana 10 → 11 warning: Major version break. Angular plugin support removed. Datasource plugin API changed. Read the migration guide before upgrading past 11.0. Verify all dashboards load after upgrade.

Harbor

# workloads/cdk8s.yaml
- helm:https://helm.goharbor.io/harbor@1.X.Y

Minor bumps only. After upgrade verify harbor:80 routing still works (Harbor nginx proxy quirk — route must target harbor:80, not harbor-core:80).

NVIDIA GPU Operator

# workloads/cdk8s.yaml
- helm:https://helm.ngc.nvidia.com/nvidia/gpu-operator@X.Y.Z

Also update device plugin and DCGM exporter versions in workloads/hardware/nvidia_gpu_operator.go:47,99. Verify RTX 5070 Ti (sm_120, Blackwell) still supported in the new operator release — Blackwell support was added in 570.x driver series.

kubectl exec -n ollama deploy/ollama -- nvidia-smi

Phase 10 — Workloads: Complex

Rancher

# workloads/cdk8s.yaml
- helm:https://releases.rancher.com/server-charts/stable/rancher@2.X.Y

Update --kube-version flag in workloads/management/rancher.go:117 to match the actual cluster k8s version after the Talos upgrade:

HelmFlags: &[]*string{jsii.String("--kube-version"), jsii.String("1.3X.0")},

n8n

// workloads/automation/n8n.go:184
Version: jsii.String("2.X.Y"),

Check latest: helm show chart oci://8gears.container-registry.com/library/n8n. DB schema migrations run automatically on pod start via CNPG. Verify n8n healthy after upgrade:

kubectl rollout status deployment n8n -n n8n

Ollama

// workloads/ai/ollama.go:33
"tag": "0.X.Y",

Also update chart in cdk8s.yaml. Downloaded models stay in the PVC — no re-pull needed.

kubectl exec -n ollama deploy/ollama -- ollama list

ComfyUI

// workloads/ai/comfyui.go:73
Image: jsii.String("yanwk/comfyui-boot:cu128-megapak-YYYYMMDD"),

Check yanwk/comfyui-boot tags — use cu128-megapak-* variants only (CUDA 12.8 required for RTX 5070 Ti / sm_120). Do not use latest-cu128 — that tag does not exist.

NetBird Peer (k8s)

// workloads/networking/netbird_peer.go:139,158  (both init + main containers)
Image: jsii.String("netbirdio/netbird:0.X.Y"),

Must match Bifrost docker-compose NetBird version exactly. Upgrade Bifrost and k8s peer in the same commit.

Kubeflow

Kubeflow is deployed via Kustomize from kubeflow/manifests, not a Helm chart. Upgrade by updating targetRevision in the ApplicationSet at core/platform/argocd.go:175. Check the kubeflow/manifests releases for the version compatible with the current k8s version.


cdk8s Import Regeneration

After changing any chart version in cdk8s.yaml, regenerate typed Go bindings:

cd workloads
cdk8s import        # regenerates workloads/imports/*
go mod tidy
just synth          # verify no compile errors before pushing

If Talos upgrades the embedded k8s version, also update the k8s@1.30.0 import at the top of cdk8s.yaml.


Post-Upgrade Verification

# Cluster health
talosctl --talosconfig ~/.talos/config health --nodes 192.168.1.210

# All ArgoCD apps synced
kubectl get applications -n argocd

# Secrets layer
kubectl exec -n openbao openbao-0 -- bao status
kubectl exec -n grafana deploy/grafana -- cat /mnt/secrets/ADMIN_PASSWORD
kubectl exec -n harbor deploy/harbor-secret-sync -- ls /mnt/secrets/

# Storage
kubectl get nodes.longhorn.io -n longhorn-system
kubectl get volume.longhorn.io -n longhorn-system
kubectl get cluster -A

# GPU
kubectl exec -n ollama deploy/ollama -- nvidia-smi

# Bifrost routing
curl -I https://auth.madhan.app
curl -I https://grafana.madhan.app

# NetBird tunnel (from Bifrost VPS)
docker exec netbird-agent netbird status

Breaking Change Reference

UpgradeBreaking change
Talos minor bumpk8s version embedded — check cdk8s.yaml k8s@ version
Cilium anyCiliumLoadBalancerIPPool / BGPCiliumPeeringPolicy field renames — check cilium.go:208,225
ArgoCD chart majorTLSRoute backend service name hash changes — check argocd.go:148
Grafana 10 → 11Angular plugins removed; datasource API changed
Longhorn minorUpgrade one minor at a time only
Rancher any--kube-version flag must match cluster k8s version
NetBird anyAll components (server/dashboard/proxy/agent/peer) must be on identical version
Authentik anyMigration race: bootstrap.sh runs ak migrate first; if health check times out SSH + force-migrate (see Phase 7)
Authentik anyAfter netbird-agent restart, verify ip rule show | grep 7120 — missing rules = restart agent again
Trivy anytrivyVersion const in trivy.go:22 and cdk8s.yaml import must match
ComfyUI anyOnly cu128-megapak-* tags work on RTX 5070 Ti — do not use latest-cu128