Skip to content
Why Build a Kubernetes Homelab?
Why Build a Kubernetes Homelab?

Why Build a Kubernetes Homelab?

Why?

Two reasons drove me to build this cluster.

Reason 1: Learning by Doing

Cloud-managed Kubernetes (EKS, GKE) abstracts away the parts I wanted to understand: CNI networking, storage orchestration, GPU scheduling, immutable OS operation, and GitOps at the infrastructure layer. You can read about eBPF kube-proxy replacement or DRA-based GPU sharing all day — or you can break it, fix it, and actually learn it.

The goal was never “run a production cluster at home.” It was to build one that could be production, so the skills transfer directly.

Reason 2: Self-hosted Infrastructure

As a solo builder, I want self-hosted infrastructure for:

  • AI/ML workloads — local inference with GPUs, fine-tuning, experiments
  • Self-hosted services — things I’d otherwise pay SaaS for
  • Product prototyping — test deployments before going to cloud

The hardware was already sitting around. The cluster turns idle machines into a platform.

The Hardware

The cluster spans 4 zones of heterogeneous hardware:

Zone A: Management

  • raspi-omni (Raspberry Pi 5, 8GB) — Runs Sidero Omni, Authentik SSO, Traefik. The management plane lives outside the cluster.

Zone B: Core HA

  • mini-1, mini-2, mini-3 (ASUS NUC, Intel Ultra 5 225H, 64GB RAM, 1TB NVMe) — Three identical nodes forming the HA control plane. Each has an Intel Arc iGPU for future media/AI workloads.

Zone C: AI Compute

  • gpu-1 (Custom desktop, i9, 128GB RAM, RTX 5070, 2x4TB SSD) — The heavy lifter. Dedicated GPU storage via Longhorn. Tainted for GPU-only workloads.

Zone D: Edge

  • pc-1 (Legacy desktop, 64GB SSD + 3x HDD) — General purpose worker.
  • raspi-1, raspi-2 (Raspberry Pi 4, 32GB SD) — Low-power edge nodes.

Architecture

Omni cluster dashboard showing CPU, pods, memory, and node status for the frank cluster

The cluster uses a two-layer management model:

  • Layer 1 (Machine Config): Sidero Omni manages Talos Linux machine configurations — OS extensions, kernel modules, disk mounts, network settings. Applied via omnictl.
  • Layer 2 (Workloads): ArgoCD manages everything running on Kubernetes — CNI, storage, GPU drivers, applications. GitOps via the same repo you’re reading.

This separation means Omni never touches workloads, and ArgoCD never touches machine config. Clean boundaries, no conflicts.

What’s Next

The rest of this series walks through each capability layer:

1 Hardware — 7 Nodes, 3 Zones
3x Intel NUC (Core zone) 1x GPU tower — RTX 5070 1x Legacy desktop 2x Raspberry Pi 4
x86_64 arm64 heterogeneous
2 OS & Bootstrap
Talos Linux (immutable) Sidero Omni (lifecycle) Declarative machine config Rolling upgrades
no SSH API-driven reproducible
3 Networking — Cilium CNI
eBPF kube-proxy replacement L2 LoadBalancer (ARP) Hubble observability Network policy
eBPF 192.168.55.200-254
4 Storage — Longhorn
Distributed 3-replica block storage GPU-local StorageClass 2x 4TB SSD on gpu-1 iSCSI via Talos extensions
strict-local best-effort all 7 nodes
5 GPU Compute
NVIDIA GPU Operator (RTX 5070) Intel DRA driver (Arc iGPU) Dynamic Resource Allocation CDI device injection
K8s 1.35 DRA ResourceClaim DeviceClass
6 GitOps — ArgoCD
App-of-Apps pattern Multi-source Applications Self-healing + drift detection Zero-downtime adoption
single repo annotation tracking
7 Fun Stuff
OpenRGB via USB HID DaemonSet + ConfigMap Custom container build (GitHub Actions) IT5701 firmware lock (in progress)
completely unnecessary fans still rainbow
8 Observability
VictoriaMetrics (metrics + alerts) VictoriaLogs (log aggregation) Grafana dashboards Fluent Bit log shipping Blackbox Exporter (endpoint probes) Pushgateway (heartbeat ingestion) Telegram alerting Health Bridge (GitHub lifecycle)
VMSingle Alertmanager Feature Health health-bridge 192.168.55.203
9 Backup
Longhorn → Cloudflare R2 Daily + weekly recurring jobs SOPS-encrypted credentials NAS target (pending Longhorn 1.13)
S3-compatible 7-day RPO
10 Secrets Management
Infisical (self-hosted vault) External Secrets Operator ClusterSecretStore ExternalSecret → K8s Secret
audit trail Universal Auth 192.168.55.204
11 Local Inference
Ollama (gpu-1, RTX 5070) LiteLLM (unified gateway) OpenRouter (free cloud models) OpenAI-compatible API
ollama litellm 192.168.55.206
12 Agentic Control Plane
Sympozium (K8s-native agents) n8n (per-user workflow automation) VK Remote (self-hosted kanban API) ElectricSQL real-time sync
agent=Pod n8n vibekanban 192.168.55.207 192.168.55.216
13 Unified Auth
Authentik IdP (OIDC + proxy) SSO for ArgoCD, Grafana, Infisical Forward auth for Longhorn, Hubble, Sympozium OIDC-backed kubectl via apiserver
OIDC forward-auth 192.168.55.211
14 Multi-tenancy
vCluster (K8s-in-K8s) Disposable experiment clusters Resource quotas + network policies GitOps-provisioned via ArgoCD
vcluster multi-tenant SQLite
15 AI Agent Orchestrator
Paperclip (org-chart agents) Virtual companies + budgets Delegation chains + governance LiteLLM gateway integration
paperclip company model 192.168.55.212
16 Media Generation
ComfyUI (diffusion models) LTX-2.3 video, SDXL image, Stable Audio GPU Switcher dashboard (Go) Time-sharing via replica scaling
comfyui gpu-switcher 192.168.55.213
17 Public Edge — Hop
Hetzner CX23 (single-node Talos) Headscale mesh + Tailscale Caddy reverse proxy + TLS Split-DNS (MagicDNS)
edge WireGuard blog.derio.net
18 Persistent Agent
Kali Linux workstation Always-on Claude Code agent SSH remote access 50Gi persistent /root
kali claude --remote 192.168.55.215
19 Progressive Delivery
Argo Rollouts controller LiteLLM canary (Cilium traffic split) Sympozium blue-green VictoriaMetrics analysis gates
canary blue-green workloadRef
21 Secure Agent Pod
Hardened non-root Kali container Cilium egress allowlist VibeKanban agent orchestration VK Relay (WebSocket tunnel to browser)
security vibekanban relay 192.168.55.215
24 In-Cluster Ingress
Traefik v3 on raspi edge nodes Wildcard TLS (*.cluster.derio.net) Authentik forward-auth (12 services) Homepage dashboard
traefik acme 192.168.55.220
25 CI/CD Platform
Gitea (GitHub mirror forge) Tekton Pipelines + Triggers Zot OCI registry (cosign signed) Webhook-driven CI on pc-1
gitea tekton zot 192.168.55.209
26 Agent Images and the VK-Local Sidecar
agent-images repo (shared base + children) Matrix CI with cross-repo repository_dispatch VK-local sidecar (shared /home/claude PVC) Lockstep bumper PR in frank
docker github-actions sidecar
Virtual Machines — upcoming
KubeVirt (VMs as pods) CDI disk image import KubeVirt Manager UI Longhorn-backed DataVolumes
KVM 192.168.55.205

Let’s start building.

References

  • Talos Linux — Immutable, secure, minimal Kubernetes OS
  • Sidero Omni — SaaS-simple Kubernetes cluster management for Talos Linux
  • Kubernetes — Production-grade container orchestration
  • ArgoCD — Declarative GitOps continuous delivery for Kubernetes
  • Cilium — eBPF-based networking, observability, and security for Kubernetes
  • Longhorn — Cloud-native distributed block storage for Kubernetes
  • NVIDIA GPU Operator — Automated GPU management in Kubernetes
  • Intel Resource Drivers for Kubernetes — DRA-based resource drivers for Intel GPUs
  • eBPF — Technology for programmable networking, observability, and security in the Linux kernel