Ollama

LLM inference server on RTX 5070 Ti GPU.

What is Ollama?

Ollama is an open-source LLM inference server that makes it easy to run large language models locally. It provides an OpenAI-compatible REST API and manages model downloads, GPU memory loading, and inference serving in a single binary.

Why Ollama?

Ollama is the simplest way to run local LLMs on a GPU. It handles model quantization, GPU memory management, and serves an OpenAI-compatible API that works with any client that supports the OpenAI SDK (Python, TypeScript, etc.). The Helm chart provides native Kubernetes integration.

How It's Used Here

Ollama runs on k8s-worker4 (the GPU node, 192.168.1.224) using the RTX 5070 Ti for inference. It stores downloaded model files on a 100 Gi Longhorn PVC.

Source: workloads/ai/ollama.go

Configuration

SettingValueWhy
NamespaceollamaIsolated namespace
Imageollama/ollama:0.17.0Pinned version
HTTPRouteollama.madhan.appollama:11434Gateway API
runtimeClassNamenvidiaRoutes through nvidia-container-runtime
NVIDIA_VISIBLE_DEVICESallMake all GPU devices visible
nvidia.com/gpu limit1One time-sliced virtual GPU
Node selectornvidia.com/gpu.present: "true"Schedule on GPU node
Tolerationdedicated=ai:NoScheduleAllow scheduling on tainted GPU node
CPU limit4000mOllama + ComfyUI both CPU-hungry at inference
RAM request2GiHost RAM for model metadata + process
RAM limit4GiKeep below worker4's 16 GiB total (shared with ComfyUI)
Model PVC100Gi LonghornStores downloaded model weights

Note: memory: 4Gi is the host RAM cgroup limit, NOT GPU VRAM. GPU VRAM (16 GB) is controlled by the nvidia.com/gpu: 1 resource limit and is shared with ComfyUI via time-slicing.

API Usage

# List available models
curl http://ollama.madhan.app/api/tags

# Run inference (streaming)
curl http://ollama.madhan.app/api/generate \
  -d '{"model": "llama3.2", "prompt": "Hello!"}'

# Pull a new model (stores to 100Gi PVC)
curl http://ollama.madhan.app/api/pull \
  -d '{"name": "mistral"}'

# OpenAI-compatible chat completions
curl http://ollama.madhan.app/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'

Coexistence with ComfyUI

Ollama and ComfyUI share the RTX 5070 Ti via GPU time-slicing (2 virtual GPUs from 1 physical). VRAM is not isolated — approximately 4 GB (Ollama 7B model) + 6 GB (ComfyUI SDXL) fits within the 16 GB pool.

If running a large model (70B quantized, ~40 GB VRAM) alongside ComfyUI, VRAM exhaustion will occur. Use smaller quantized models or stop ComfyUI first.

How It Connects

Browser / API client → ollama.madhan.app
  → homelab-gateway → ollama:11434
  → Ollama pod on k8s-worker4
  → nvidia-container-runtime (GPU injection)
  → RTX 5070 Ti (VRAM for model weights)
  → 100Gi Longhorn PVC (model file storage)

Troubleshooting

Model Pull Fails / OOM

Symptoms: Model pull command hangs or returns OOM error.

# Check disk space on PVC
kubectl exec -n ollama -l app.kubernetes.io/name=ollama -- df -h /root/.ollama

# Check GPU memory
kubectl exec -n ollama -l app.kubernetes.io/name=ollama -- nvidia-smi

GPU Not Available in Pod

# Verify pod spec
kubectl get pod -n ollama -o yaml | grep -E "runtimeClass|nvidia|dedicated"

# Check node advertises GPU
kubectl describe node k8s-worker4 | grep "nvidia.com/gpu"

Inference Too Slow

At idle, the RTX 5070 Ti uses 0dB fan mode and is essentially cold. First inference after idle warms up the GPU. Subsequent inferences are faster.