Ollama
LLM inference server on RTX 5070 Ti GPU.
Overview
| Property | Value |
|---|---|
| CDK8s file | platform/cdk8s/cots/ai/ollama.go |
| Namespace | ollama |
| Helm chart | ollama v1.41.0 (otwld.github.io/ollama-helm) |
| HTTPRoute | ollama.madhan.app → ollama:11434 |
| UI | No (REST API only) |
| Node | k8s-worker4 (GPU) |
Purpose
Runs open-source LLMs locally using the NVIDIA RTX 5070 Ti. Exposes an OpenAI-compatible REST API at http://ollama.madhan.app.
Supported models include: llama3.2, mistral, deepseek-r1, and any model in the Ollama library.
GPU Configuration
runtimeClassName: nvidia
resources:
requests:
memory: "2Gi"
limits:
nvidia.com/gpu: "1"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
API Usage
# List available models
curl http://ollama.madhan.app/api/tags
# Run inference (streaming)
curl http://ollama.madhan.app/api/generate \
-d '{"model": "llama3.2", "prompt": "Hello!"}'
# Pull a new model
curl http://ollama.madhan.app/api/pull \
-d '{"name": "mistral"}'
Node Scheduling
The pod is scheduled to k8s-worker4 via a nodeSelector or nodeName constraint targeting the GPU node.
Coexistence with ComfyUI
Ollama (2 Gi RAM) and ComfyUI (1 Gi RAM request) run simultaneously on the same node via GPU time-slicing. Total RAM on node is ~5.4 Gi allocatable — the combined 3 Gi fits within limits.