ComfyUI
Node-based image generation UI for Stable Diffusion and Flux models.
What is ComfyUI?
ComfyUI is a powerful, modular, node-based GUI for Stable Diffusion and Flux image generation models. It allows building complex image generation pipelines by connecting nodes (samplers, models, VAEs, ControlNets, LoRAs, etc.) in a visual graph editor.
Why ComfyUI?
ComfyUI is the most flexible and performance-oriented frontend for diffusion models. Unlike Automatic1111 (which uses a form-based interface), ComfyUI exposes every parameter of the generation pipeline as a connectable node, enabling complex workflows that would be impossible in simpler UIs.
How It's Used Here
ComfyUI runs on k8s-worker4 (GPU node) as a standard Kubernetes Deployment. Model files are stored on a 50 Gi RWX Longhorn PVC shared across pod restarts.
Source: workloads/ai/comfyui.go
Configuration
| Setting | Value | Why |
|---|---|---|
| Namespace | comfyui | Isolated namespace |
| Image | yanwk/comfyui-boot:cu128-megapak-20260223 | CUDA 12.8, matching Talos driver 570.x |
| HTTPRoute | comfyui.madhan.app → comfyui:8188 | Gateway API |
| Deploy strategy | Recreate | GPU workloads cannot have two pods claiming nvidia.com/gpu simultaneously |
runtimeClassName | nvidia | Routes through nvidia-container-runtime |
NVIDIA_VISIBLE_DEVICES | all | Make all GPU devices visible |
nvidia.com/gpu limit | 1 | One time-sliced virtual GPU |
| Node selector | nvidia.com/gpu.present: "true" | Schedule on GPU node |
| Toleration | dedicated=ai:NoSchedule | Allow scheduling on tainted GPU node |
| CPU limit | 4000m | CPU-hungry during inference |
| RAM request | 1Gi | Host RAM for process |
| RAM limit | 8Gi | Allow headroom for large model loading |
| Data PVC | 50Gi RWX Longhorn | Models, outputs, custom nodes |
| Data mount path | /home/user/opt/ComfyUI | Image's working directory |
CLI_ARGS | --listen 0.0.0.0 --port 8188 | Bind all interfaces for Service to reach |
Image Note
The image tag cu128-megapak-20260223 is a dated CUDA 12.8 "megapak" build that includes PyTorch, xformers, and many common custom nodes pre-installed. The latest-cu128 tag does not exist on Docker Hub — always use a specific dated tag like cu128-megapak-20260223.
Recreate Strategy
ComfyUI uses strategy: Recreate instead of the default RollingUpdate:
Strategy: &k8s.DeploymentStrategy{Type: jsii.String("Recreate")},
This is required because GPU workloads cannot have two pods claiming nvidia.com/gpu simultaneously. With RollingUpdate, the new pod starts before the old pod terminates, causing the new pod to get stuck waiting for the GPU resource.
Storage
A 50 Gi RWX Longhorn PVC is mounted at /home/user/opt/ComfyUI. This stores:
- Model checkpoints (
.safetensors,.ckpt) - LoRA files
- ControlNet models
- Custom nodes (installed via ComfyUI Manager)
- Generated output images
Using RWX allows future multi-replica scenarios and avoids attach conflicts.
How It Connects
Browser → comfyui.madhan.app
→ homelab-gateway → comfyui:8188
→ ComfyUI pod on k8s-worker4
→ nvidia-container-runtime (GPU injection)
→ RTX 5070 Ti (VRAM for model inference)
→ 50Gi Longhorn RWX PVC (model files, outputs)
Screenshots

Troubleshooting
Pod Stuck in ContainerCreating
Symptoms: New pod won't start, events show GPU resource not available.
Fix: Old pod may still be terminating. The Recreate strategy ensures the old pod terminates first, but if it gets stuck:
kubectl delete pod -n comfyui <old-pod> --grace-period=0 --force
Out of VRAM
Symptoms: ComfyUI shows CUDA out of memory during generation.
Fix: Reduce model size (use smaller quantized model) or free VRAM by offloading models in the Ollama API:
# Unload Ollama models from VRAM
curl http://ollama.madhan.app/api/generate -d '{"model": "llama3.2", "keep_alive": 0}'
Custom Nodes Not Installing
Custom nodes installed via ComfyUI Manager persist because they write to the 50 Gi PVC at /home/user/opt/ComfyUI/custom_nodes/. If the PVC is deleted, all custom nodes are lost.