A Kubernetes-based LLM inference deployment with GPU support, model management, and monitoring using vLLM, Istio, MinIO, Prometheus, and Grafana.
This platform enables deployment and serving of LLM models on Kubernetes with:
- GPU-accelerated inference using vLLM
- Centralized model storage using MinIO
- Web-based interaction via Open WebUI
- Observability using Prometheus and Grafana
- Secure traffic routing via Istio Gateway
Entry point for all external traffic.
Hosts:
open-webui.*→ Open WebUIgrafana.*→ Grafana dashboardsminio-console.*→ MinIO console
Main web interface for interacting with the deployed LLM model.
Deployment runs in its own namespace with:
-
vLLM Pod (GPU)
Runs the inference server on GPU-enabled nodes. -
Init Container
Downloads model files from MinIO during pod startup. -
Persistent Volume (PVC)
Stores downloaded model files to avoid re-downloading. -
Network Policies
Restrict traffic to only required services (WebUI, monitoring).
Centralized object storage for LLM model artifacts.
- Stores model weights
- Used by init containers to fetch models
- Prometheus collects metrics from nodes and workloads
- Grafana visualizes GPU, pod, and system performance
- GPU nodes are used exclusively for inference workloads
- vLLM pods request GPUs via Kubernetes resource limits





