HJK Talos Cluster Setup

Welcome to the documentation for the setup and operation of a single-node Kubernetes cluster based on:

  • Talos Linux
  • Cilium (CNI)
  • Rook Ceph (single-node storage for test and development)
  • OS2AI / AarhusAI
  • Access from Ubuntu WSL2 on developer workstations

The purpose of this documentation is to provide a consistent, reproducible, and transparent approach to installing, operating, and troubleshooting the cluster, both for current and future team members.

This documentation describes the how and why of the platform.
The actual infrastructure-as-code (IaC) implementation lives in a separate repository.


Documentation structure

Below is an overview of the main documentation sections grouped by responsibility.


Cluster Foundation

  1. 01 – Environment
  2. 02 – Bootstrap
  3. 03 – Cilium
  4. 04 – Rook Ceph
  5. 05 – WSL Access
  6. 06 – NVIDIA GPU

Platform Management (GitOps & Control Plane)

  1. 07 – Argo CD (GitOps)
  2. 08 – Observability
  3. 09 – Sealed Secrets

Applications (AI Platform Components)

  1. 10 – vLLM
  2. 11 – LiteLLM
  3. 12 – Open WebUI

Operations

  1. 13 – Upgrades
  2. 16 – Talos & Kubernetes Upgrades
  3. 90 – Troubleshooting

1. Environment preparation

Hardware requirements, networking, static IP addressing, WSL2 setup, and required tooling.

Chapter: 01 – Environment


2. Talos installation and cluster bootstrap

Booting from ISO, generating cluster configuration, applying patches, and bootstrapping the control plane.

Chapter: 02 – Bootstrap


3. Cilium installation

Installing Cilium as the Kubernetes CNI using Helm, validating networking, and troubleshooting datapath issues.

Chapter: 03 – Cilium


4. Rook Ceph installation

Deploying Rook Ceph in a single-node configuration, configuring StorageClasses, and understanding limitations.

Chapter: 04 – Rook Ceph


5. Cluster access from WSL2

Accessing Talos and Kubernetes from Windows via Ubuntu WSL2, including kubeconfig handling and networking considerations.

Chapter: 05 – WSL Access


6. NVIDIA GPU enablement

Enabling NVIDIA GPUs on Talos, including drivers, container runtime, RuntimeClass, and device plugin.

Chapter: 06 – NVIDIA GPU


7. Argo CD (GitOps)

Bootstrapping GitOps with Argo CD and the app-of-apps pattern (Argo CD Resources), including authentication to private Git repositories.

Chapter: 07 – Argo CD (GitOps)

8. Observability (Prometheus, Grafana, Loki, Tempo)

Deploying and operating the cluster observability stack, including Grafana authentication and datasource wiring for logs and traces.

Chapter: 08 – Observability


9. Sealed Secrets

Managing application secrets securely using Sealed Secrets in a GitOps workflow.

Chapter: 09 – Sealed Secrets


10. vLLM

Deploying GPU-accelerated inference and embedding workloads using vLLM.

Chapter: 10 – vLLM


11. LiteLLM

Deploying LiteLLM as an OpenAI-compatible proxy in front of model backends, including database persistence and guardrails.

Chapter: 11 – LiteLLM


12. Open WebUI

Deploying Open WebUI as the user-facing chat interface, integrated with LiteLLM, vLLM, RAG, and persistence services.

Chapter: 12 – Open WebUI


13. Upgrading applications

How to upgrade platform components and applications using the vendor submodule and local overrides.

Chapter: 13 – Upgrades


16. Talos & Kubernetes Upgrades

Procedure for upgrading Talos OS, talosctl client, and Kubernetes version including NVIDIA system extensions.

Chapter: 16 – Talos & Kubernetes Upgrades

15. Troubleshooting and FAQ

Common failure scenarios related to Talos, Kubernetes, Cilium, Rook Ceph, WSL2, and recovery procedures.

Chapter: 90 – Troubleshooting


Scope and intent

This documentation supports:

  • Operation and maintenance of a local single-node Kubernetes cluster for AI workloads
  • Internal OS2 projects including OS2AI
  • Reproducible infrastructure based on scripted workflows and declarative configuration
  • A clear separation between:
    • Test / development environments (single-node)
    • Future production-grade platforms (multi-node, HA)

Target audience

This documentation is intended for:

  • IT operations staff
  • System administrators
  • Developers responsible for OS2AI and related platforms
  • Future municipal technical operations teams

Prerequisites

Readers are expected to have basic knowledge of:

  • Linux and WSL2
  • Kubernetes fundamentals
  • YAML configuration files
  • Basic networking concepts
  • Helm-based application deployment

Talos-specific concepts and workflows are explained where relevant throughout the documentation.


This site uses Just the Docs, a documentation theme for Jekyll.