ComfyUI

Node-based image generation UI for Stable Diffusion and Flux models.

What is ComfyUI?

ComfyUI is a powerful, modular, node-based GUI for Stable Diffusion and Flux image generation models. It allows building complex image generation pipelines by connecting nodes (samplers, models, VAEs, ControlNets, LoRAs, etc.) in a visual graph editor.

Why ComfyUI?

ComfyUI is the most flexible and performance-oriented frontend for diffusion models. Unlike Automatic1111 (which uses a form-based interface), ComfyUI exposes every parameter of the generation pipeline as a connectable node, enabling complex workflows that would be impossible in simpler UIs.

How It's Used Here

ComfyUI runs on k8s-worker4 (GPU node) as a standard Kubernetes Deployment. Model files are stored on a 50 Gi RWX Longhorn PVC shared across pod restarts.

Source: workloads/ai/comfyui.go

Configuration

SettingValueWhy
NamespacecomfyuiIsolated namespace
Imageyanwk/comfyui-boot:cu128-megapak-20260223CUDA 12.8, matching Talos driver 570.x
HTTPRoutecomfyui.madhan.appcomfyui:8188Gateway API
Deploy strategyRecreateGPU workloads cannot have two pods claiming nvidia.com/gpu simultaneously
runtimeClassNamenvidiaRoutes through nvidia-container-runtime
NVIDIA_VISIBLE_DEVICESallMake all GPU devices visible
nvidia.com/gpu limit1One time-sliced virtual GPU
Node selectornvidia.com/gpu.present: "true"Schedule on GPU node
Tolerationdedicated=ai:NoScheduleAllow scheduling on tainted GPU node
CPU limit4000mCPU-hungry during inference
RAM request1GiHost RAM for process
RAM limit8GiAllow headroom for large model loading
Data PVC50Gi RWX LonghornModels, outputs, custom nodes
Data mount path/home/user/opt/ComfyUIImage's working directory
CLI_ARGS--listen 0.0.0.0 --port 8188Bind all interfaces for Service to reach

Image Note

The image tag cu128-megapak-20260223 is a dated CUDA 12.8 "megapak" build that includes PyTorch, xformers, and many common custom nodes pre-installed. The latest-cu128 tag does not exist on Docker Hub — always use a specific dated tag like cu128-megapak-20260223.

Recreate Strategy

ComfyUI uses strategy: Recreate instead of the default RollingUpdate:

Strategy: &k8s.DeploymentStrategy{Type: jsii.String("Recreate")},

This is required because GPU workloads cannot have two pods claiming nvidia.com/gpu simultaneously. With RollingUpdate, the new pod starts before the old pod terminates, causing the new pod to get stuck waiting for the GPU resource.

Storage

A 50 Gi RWX Longhorn PVC is mounted at /home/user/opt/ComfyUI. This stores:

  • Model checkpoints (.safetensors, .ckpt)
  • LoRA files
  • ControlNet models
  • Custom nodes (installed via ComfyUI Manager)
  • Generated output images

Using RWX allows future multi-replica scenarios and avoids attach conflicts.

How It Connects

Browser → comfyui.madhan.app
  → homelab-gateway → comfyui:8188
  → ComfyUI pod on k8s-worker4
  → nvidia-container-runtime (GPU injection)
  → RTX 5070 Ti (VRAM for model inference)
  → 50Gi Longhorn RWX PVC (model files, outputs)

Screenshots

ComfyUI node graph showing a Stable Diffusion workflow with samplers, VAE, and ControlNet nodes

Troubleshooting

Pod Stuck in ContainerCreating

Symptoms: New pod won't start, events show GPU resource not available.

Fix: Old pod may still be terminating. The Recreate strategy ensures the old pod terminates first, but if it gets stuck:

kubectl delete pod -n comfyui <old-pod> --grace-period=0 --force

Out of VRAM

Symptoms: ComfyUI shows CUDA out of memory during generation.

Fix: Reduce model size (use smaller quantized model) or free VRAM by offloading models in the Ollama API:

# Unload Ollama models from VRAM
curl http://ollama.madhan.app/api/generate -d '{"model": "llama3.2", "keep_alive": 0}'

Custom Nodes Not Installing

Custom nodes installed via ComfyUI Manager persist because they write to the 50 Gi PVC at /home/user/opt/ComfyUI/custom_nodes/. If the PVC is deleted, all custom nodes are lost.