
Operating on Secure Agent Pod
This is the operational companion to Secure Agent Pod — Hardening an AI Coding Workstation. That post explains the architecture and security model. This one is the day-to-day runbook.
What “Healthy” Looks Like
A healthy secure-agent-pod has:
- One pod running (
1/1 Ready) on gpu-1 - Three processes inside: sshd, supercronic, vibe-kanban
- SSH accessible at
192.168.55.215:22 - VibeKanban UI accessible at
192.168.55.218:8081 - All running as UID 1000 (
claude), no root
Observing State
Pod Health
# Pod status
kubectl -n secure-agent-pod get pods -o wide
# Detailed events and conditions
kubectl -n secure-agent-pod describe pod -l app=secure-agent-pod
# Container identity
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- id
# Expected: uid=1000(claude) gid=1000(claude) groups=1000(claude)Process Health
All three processes should be running inside the container:
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- ps auxExpected output:
USER PID COMMAND
claude 1 /bin/bash /entrypoint.sh
claude 12 sshd: /usr/sbin/sshd -f /opt/sshd_config -D [listener]
claude 13 supercronic /home/claude/.crontab
claude 14 node /usr/bin/vibe-kanban
claude 33 /home/claude/.vibe-kanban/bin/.../vibe-kanbanIf any process is missing, the pod will restart via wait -n — check restart count.
$ kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- ps -eo pid,user,etime,cmd | head -12
PID USER ELAPSED CMD
1 claude 06:51:00 /usr/bin/tini -- /entrypoint.sh
7 claude 06:51:00 /bin/bash /entrypoint.sh
61 claude 06:51:00 sshd: /usr/sbin/sshd -f /opt/sshd_config -D [listener]
62 claude 06:51:00 supercronic /home/claude/.crontab
87 claude 06:50:55 claude remote-control --name willikins
114 claude 06:50:52 /home/claude/.local/share/claude/versions/2.1.114 --print ...
139 claude 06:50:52 npm exec @upstash/context7-mcp
140 claude 06:50:52 npm exec @playwright/mcp@latest
500 claude 06:50:52 /home/claude/.vscode-server/.../server/out/server-main.js ...
# ... +30 more child processes (mcp servers, vscode workers, editor terminals)
Services and Networking
# Verify LoadBalancer IPs
kubectl -n secure-agent-pod get svc
# SSH connectivity
ssh -o ConnectTimeout=5 claude@192.168.55.215 echo "SSH works"
# VibeKanban health
curl -s -o /dev/null -w "%{http_code}" http://192.168.55.218:8081
# Expected: 200ArgoCD Sync Status
argocd app get secure-agent-pod --port-forward --port-forward-namespace argocdSSH Access
Connecting
# Standard SSH
ssh claude@192.168.55.215
# With specific key
ssh -i ~/.ssh/id_rsa claude@192.168.55.215The Service maps external port 22 → internal port 2222 (non-root sshd). SSH clients don’t need to specify a port.
Updating Authorized Keys
The SSH authorized keys come from a Kubernetes Secret:
# View current keys
kubectl get secret agent-ssh-keys -n secure-agent-pod -o jsonpath='{.data.authorized_keys}' | base64 -d
# Replace with a new key
kubectl create secret generic agent-ssh-keys \
--namespace=secure-agent-pod \
--from-file=authorized_keys=~/.ssh/id_rsa.pub \
--dry-run=client -o yaml | kubectl apply -f -
# Restart pod to pick up the new key
kubectl rollout restart deployment/secure-agent-pod -n secure-agent-podThe entrypoint copies authorized_keys from the Secret mount to ~/.ssh/authorized_keys on each boot.
SSH Host Key Changed Warning
If you see “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED”, the PVC was recreated (host keys regenerated):
ssh-keygen -R 192.168.55.215
ssh claude@192.168.55.215VibeKanban
Accessing the UI
VibeKanban runs on port 8081, exposed via LoadBalancer at 192.168.55.218:
http://192.168.55.218:8081Access via Tailscale/Headscale mesh or direct LAN. First login uses VibeKanban’s built-in local auth.
Configuration
VibeKanban stores its SQLite database and binary cache on the PVC:
/home/claude/.vibe-kanban/ # Binary cache + SQLite DB
/home/claude/repos/ # Git workspaces managed by VibeKanbanKey environment variables (set in the Deployment manifest):
| Variable | Value | Purpose |
|---|---|---|
PORT | 8081 | Fixed server port (default is random) |
HOST | 0.0.0.0 | Listen on all interfaces (default is 127.0.0.1) |
Checking VibeKanban Logs
# Full pod logs (includes all three processes)
kubectl logs -n secure-agent-pod deploy/secure-agent-pod -c kali
# Follow logs
kubectl logs -n secure-agent-pod deploy/secure-agent-pod -c kali -f
# Filter for VibeKanban only
kubectl logs -n secure-agent-pod deploy/secure-agent-pod -c kali | grep -E "vibe-kanban|server|INFO|WARN|ERROR"Secret Management
Tier 1: Infisical (ESO)
Currently no active Tier 1 secrets (Claude Code uses Max subscription login). When needed:
- Add the secret to Infisical
- Create/update the ExternalSecret manifest at
apps/secure-agent-pod/manifests/externalsecret.yaml - Commit and push — ArgoCD syncs, ESO creates the K8s Secret
- Restart the pod to pick up new env vars
Tier 2: Manual Secrets
# View current tier-2 secrets
kubectl get secret agent-secrets-tier2 -n secure-agent-pod -o jsonpath='{.data}' | python3 -c "import json,sys,base64; d=json.load(sys.stdin); [print(f'{k}: {base64.b64decode(v).decode()[:20]}...') for k,v in d.items()]"
# Update a secret value
kubectl patch secret agent-secrets-tier2 -n secure-agent-pod \
--type merge -p '{"stringData":{"GITHUB_TOKEN":"new-token-here"}}'
# Restart to pick up changes
kubectl rollout restart deployment/secure-agent-pod -n secure-agent-podConfig Files (talosconfig, kubeconfig, omniconfig)
Mounted at /home/claude/.kube/configs/:
# Verify configs are mounted
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- ls -la /home/claude/.kube/configs/
# Rotate configs
sops --decrypt secrets/secure-agent-pod/agent-configs.yaml | kubectl apply -f -
kubectl rollout restart deployment/secure-agent-pod -n secure-agent-podPod Lifecycle
Restarting
# Graceful restart (new pod, then old pod terminates)
kubectl rollout restart deployment/secure-agent-pod -n secure-agent-pod
# Force restart (immediate)
kubectl delete pod -l app=secure-agent-pod -n secure-agent-podStrategy is Recreate (RWO PVC), so there’s always a brief downtime during restart.
Checking PVC Data
# PVC status
kubectl get pvc -n secure-agent-pod
# What's on the PVC
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- du -sh /home/claude/*Image Updates
The deployment uses ghcr.io/derio-net/secure-agent-kali:latest. To pick up a new image:
# Force pull latest
kubectl rollout restart deployment/secure-agent-pod -n secure-agent-podFor pinned SHA tags, update the image: field in apps/secure-agent-pod/manifests/deployment.yaml and push.
Cron Jobs
Cron is managed by supercronic reading /home/claude/.crontab. The crontab template is seeded from the image on first boot; after that it’s user-modifiable on the PVC.
Scripts live at /opt/scripts/ — baked into the image, immutable. They update when the secure-agent-kali image is rebuilt and the pod is restarted.
# View current crontab
cat ~/.crontab
# View available scripts
ls /opt/scripts/
# audit-digest.sh guardrails-hook.py push-heartbeat.sh
# exercise-cron.sh notify-telegram.sh session-manager.sh
# Edit crontab (supercronic picks up changes automatically)
vi ~/.crontabCurrent Schedule
| Job | Schedule | Script |
|---|---|---|
| Session manager | Every 5 min | /opt/scripts/session-manager.sh |
| Self-update (git pull) | Daily 04:00 UTC | inline |
| Claude Code update | Weekly Sun 04:30 UTC | inline |
| Exercise reminders | 5x daily, Fri-Mon | /opt/scripts/exercise-cron.sh |
| Audit digest | Daily 21:00 UTC | /opt/scripts/audit-digest.sh |
Updating Scripts
Scripts at /opt/scripts/ are read-only (from the image). To update them:
- Commit changes to the
secure-agent-kalirepo - GHA rebuilds and pushes the image to GHCR
- Restart the pod:
kubectl rollout restart deployment/secure-agent-pod -n secure-agent-pod
The crontab on the PVC is independent — editing ~/.crontab takes effect immediately via supercronic’s file watcher.
Health Monitoring
Each cron script pushes a heartbeat metric to Prometheus Pushgateway after successful execution:
# Check current heartbeat state
curl -s http://pushgateway.monitoring.svc.cluster.local:9091/api/v1/metrics | \
python3 -c "
import json, sys
from datetime import datetime, timezone
for g in json.load(sys.stdin)['data']:
job = g['labels'].get('job','?')
ts = float(g['push_time_seconds']['metrics'][0]['value'])
dt = datetime.fromtimestamp(ts, tz=timezone.utc).strftime('%H:%M:%S UTC')
print(f'{job:30s} {dt}')
"Grafana alert rules fire when heartbeats go stale:
| Alert | Threshold | Severity |
|---|---|---|
exercise-reminder-stale | 3 hours | critical |
audit-digest-stale | 26 hours | warning |
session-manager-stale | 10 minutes | critical |
Alerts are bridged to GitHub Issues via the health-bridge webhook — the Quartermaster (Willikins staff) tracks these on the “Derio Ops” project board.
Manually Pushing a Heartbeat
/opt/scripts/push-heartbeat.sh <job_name> [label=value ...]
# Example: /opt/scripts/push-heartbeat.sh exercise_reminder context=deskCilium Egress Policy
Current status: Temporarily disabled due to Cilium 1.17 FQDN LRU bug.
The policy manifest is at apps/secure-agent-pod/cilium-egress.yaml.disabled. To re-enable:
# Move back to manifests directory
mv apps/secure-agent-pod/cilium-egress.yaml.disabled apps/secure-agent-pod/manifests/cilium-egress.yaml
git add -A && git commit -m "feat(agents): re-enable Cilium egress policy" && git push
# Verify policy status
kubectl get ciliumnetworkpolicy -n secure-agent-pod
# VALID column should be TrueIf the policy shows VALID: False with “LRU not yet initialized”:
# Restart Cilium agent on gpu-1
kubectl delete pod -n kube-system -l k8s-app=cilium --field-selector spec.nodeName=gpu-1
# Wait for agent restart, then delete and reapply
kubectl delete ciliumnetworkpolicy agent-egress -n secure-agent-pod
kubectl apply -f apps/secure-agent-pod/manifests/cilium-egress.yaml
# Delete the pod to clear stale BPF state
kubectl delete pod -l app=secure-agent-pod -n secure-agent-podTesting Egress (When Policy is Active)
# Should SUCCEED
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- \
curl -s --connect-timeout 5 -o /dev/null -w "%{http_code}" https://api.anthropic.com/
# Expected: 404
# Should FAIL (blocked)
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- \
curl -s --connect-timeout 5 https://httpbin.org/ip
# Expected: timeoutTroubleshooting
CrashLoopBackOff
Check which process died:
kubectl logs -n secure-agent-pod deploy/secure-agent-pod -c kali --previousCommon causes:
- “Download failed” — VibeKanban can’t reach
npm-cdn.vibekanban.com(Cilium blocking or DNS issue) - “No such file or directory: /entrypoint.sh” — image didn’t include the entrypoint (rebuild needed)
- sshd fails — check host key permissions (
chmod 600on private keys,chmod 700on.ssh-host-keys/)
Pod Stuck in CreateContainerConfigError
A referenced Secret doesn’t exist:
kubectl describe pod -l app=secure-agent-pod -n secure-agent-pod | grep -A5 "Warning"The agent-secrets-tier1 and agent-secrets-tier2 secretRefs are optional: true, so they won’t block. But agent-ssh-keys is required — create it if missing:
kubectl create secret generic agent-ssh-keys \
--namespace=secure-agent-pod \
--from-file=authorized_keys=~/.ssh/id_rsa.pubCan’t SSH In
- Check pod is running:
kubectl get pods -n secure-agent-pod - Check sshd process:
kubectl exec ... -- ps aux | grep sshd - Check service IP:
kubectl get svc -n secure-agent-pod— verify192.168.55.215is assigned - Check authorized_keys:
kubectl exec ... -- cat /home/claude/.ssh/authorized_keys - Check sshd logs:
kubectl logs ... | grep sshd
VibeKanban Not Accessible
- Check process:
kubectl exec ... -- pgrep -f vibe-kanban - Check port:
kubectl exec ... -- curl -s http://127.0.0.1:8081— should return HTML - Check service:
kubectl get svc secure-agent-vibekanban -n secure-agent-pod - Check env vars:
kubectl exec ... -- env | grep -E "PORT|HOST"— should showPORT=8081,HOST=0.0.0.0
Env Vars Not Updated After Secret Change
Environment variables are set at container start. After changing a Secret:
kubectl rollout restart deployment/secure-agent-pod -n secure-agent-pod