Operating on Secure Agent Pod

This is the operational companion to Secure Agent Pod — Hardening an AI Coding Workstation. That post explains the architecture and security model. This one is the day-to-day runbook.

What “Healthy” Looks Like

A healthy secure-agent-pod has:

One pod running (1/1 Ready) on gpu-1
Three processes inside: sshd, supercronic, vibe-kanban
SSH accessible at 192.168.55.215:22
VibeKanban UI accessible at 192.168.55.218:8081
All running as UID 1000 (claude), no root

Observing State

Pod Health

# Pod status
kubectl -n secure-agent-pod get pods -o wide

# Detailed events and conditions
kubectl -n secure-agent-pod describe pod -l app=secure-agent-pod

# Container identity
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- id
# Expected: uid=1000(claude) gid=1000(claude) groups=1000(claude)

Process Health

All three processes should be running inside the container:

kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- ps aux

Expected output:

USER       PID  COMMAND
claude       1  /bin/bash /entrypoint.sh
claude      12  sshd: /usr/sbin/sshd -f /opt/sshd_config -D [listener]
claude      13  supercronic /home/claude/.crontab
claude      14  node /usr/bin/vibe-kanban
claude      33  /home/claude/.vibe-kanban/bin/.../vibe-kanban

If any process is missing, the pod will restart via wait -n — check restart count.

$ kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- ps -eo pid,user,etime,cmd | head -12
    PID USER         ELAPSED CMD
      1 claude       06:51:00 /usr/bin/tini -- /entrypoint.sh
      7 claude       06:51:00 /bin/bash /entrypoint.sh
     61 claude       06:51:00 sshd: /usr/sbin/sshd -f /opt/sshd_config -D [listener]
     62 claude       06:51:00 supercronic /home/claude/.crontab
     87 claude       06:50:55 claude remote-control --name willikins
    114 claude       06:50:52 /home/claude/.local/share/claude/versions/2.1.114 --print ...
    139 claude       06:50:52 npm exec @upstash/context7-mcp
    140 claude       06:50:52 npm exec @playwright/mcp@latest
    500 claude       06:50:52 /home/claude/.vscode-server/.../server/out/server-main.js ...
# ... +30 more child processes (mcp servers, vscode workers, editor terminals)

Services and Networking

# Verify LoadBalancer IPs
kubectl -n secure-agent-pod get svc

# SSH connectivity
ssh -o ConnectTimeout=5 claude@192.168.55.215 echo "SSH works"

# VibeKanban health
curl -s -o /dev/null -w "%{http_code}" http://192.168.55.218:8081
# Expected: 200

ArgoCD Sync Status

argocd app get secure-agent-pod --port-forward --port-forward-namespace argocd

SSH Access

Connecting

# Standard SSH
ssh claude@192.168.55.215

# With specific key
ssh -i ~/.ssh/id_rsa claude@192.168.55.215

The Service maps external port 22 → internal port 2222 (non-root sshd). SSH clients don’t need to specify a port.

Updating Authorized Keys

The SSH authorized keys come from a Kubernetes Secret:

# View current keys
kubectl get secret agent-ssh-keys -n secure-agent-pod -o jsonpath='{.data.authorized_keys}' | base64 -d

# Replace with a new key
kubectl create secret generic agent-ssh-keys \
  --namespace=secure-agent-pod \
  --from-file=authorized_keys=~/.ssh/id_rsa.pub \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart pod to pick up the new key
kubectl rollout restart deployment/secure-agent-pod -n secure-agent-pod

The entrypoint copies authorized_keys from the Secret mount to ~/.ssh/authorized_keys on each boot.

SSH Host Key Changed Warning

If you see “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED”, the PVC was recreated (host keys regenerated):

ssh-keygen -R 192.168.55.215
ssh claude@192.168.55.215

VibeKanban

Accessing the UI

VibeKanban runs on port 8081, exposed via LoadBalancer at 192.168.55.218:

http://192.168.55.218:8081

Access via Tailscale/Headscale mesh or direct LAN. First login uses VibeKanban’s built-in local auth.

Configuration

VibeKanban stores its SQLite database and binary cache on the PVC:

/home/claude/.vibe-kanban/     # Binary cache + SQLite DB
/home/claude/repos/            # Git workspaces managed by VibeKanban

Key environment variables (set in the Deployment manifest):

Variable	Value	Purpose
`PORT`	`8081`	Fixed server port (default is random)
`HOST`	`0.0.0.0`	Listen on all interfaces (default is 127.0.0.1)

Checking VibeKanban Logs

# Full pod logs (includes all three processes)
kubectl logs -n secure-agent-pod deploy/secure-agent-pod -c kali

# Follow logs
kubectl logs -n secure-agent-pod deploy/secure-agent-pod -c kali -f

# Filter for VibeKanban only
kubectl logs -n secure-agent-pod deploy/secure-agent-pod -c kali | grep -E "vibe-kanban|server|INFO|WARN|ERROR"

Secret Management

Tier 1: Infisical (ESO)

Currently no active Tier 1 secrets (Claude Code uses Max subscription login). When needed:

Add the secret to Infisical
Create/update the ExternalSecret manifest at apps/secure-agent-pod/manifests/externalsecret.yaml
Commit and push — ArgoCD syncs, ESO creates the K8s Secret
Restart the pod to pick up new env vars

Tier 2: Manual Secrets

# View current tier-2 secrets
kubectl get secret agent-secrets-tier2 -n secure-agent-pod -o jsonpath='{.data}' | python3 -c "import json,sys,base64; d=json.load(sys.stdin); [print(f'{k}: {base64.b64decode(v).decode()[:20]}...') for k,v in d.items()]"

# Update a secret value
kubectl patch secret agent-secrets-tier2 -n secure-agent-pod \
  --type merge -p '{"stringData":{"GITHUB_TOKEN":"new-token-here"}}'

# Restart to pick up changes
kubectl rollout restart deployment/secure-agent-pod -n secure-agent-pod

Config Files (talosconfig, kubeconfig, omniconfig)

Mounted at /home/claude/.kube/configs/:

# Verify configs are mounted
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- ls -la /home/claude/.kube/configs/

# Rotate configs
sops --decrypt secrets/secure-agent-pod/agent-configs.yaml | kubectl apply -f -
kubectl rollout restart deployment/secure-agent-pod -n secure-agent-pod

Pod Lifecycle

Restarting

# Graceful restart (new pod, then old pod terminates)
kubectl rollout restart deployment/secure-agent-pod -n secure-agent-pod

# Force restart (immediate)
kubectl delete pod -l app=secure-agent-pod -n secure-agent-pod

Strategy is Recreate (RWO PVC), so there’s always a brief downtime during restart.

Checking PVC Data

# PVC status
kubectl get pvc -n secure-agent-pod

# What's on the PVC
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- du -sh /home/claude/*

Image Updates

The deployment uses ghcr.io/derio-net/secure-agent-kali:latest. To pick up a new image:

# Force pull latest
kubectl rollout restart deployment/secure-agent-pod -n secure-agent-pod

For pinned SHA tags, update the image: field in apps/secure-agent-pod/manifests/deployment.yaml and push.

Cron Jobs

Cron is managed by supercronic reading /home/claude/.crontab. The crontab template is seeded from the image on first boot; after that it’s user-modifiable on the PVC.

Scripts live at /opt/scripts/ — baked into the image, immutable. They update when the secure-agent-kali image is rebuilt and the pod is restarted.

# View current crontab
cat ~/.crontab

# View available scripts
ls /opt/scripts/
# audit-digest.sh      guardrails-hook.py  push-heartbeat.sh
# exercise-cron.sh     notify-telegram.sh  session-manager.sh

# Edit crontab (supercronic picks up changes automatically)
vi ~/.crontab

Current Schedule

Job	Schedule	Script
Session manager	Every 5 min	`/opt/scripts/session-manager.sh`
Self-update (git pull)	Daily 04:00 UTC	inline
Claude Code update	Weekly Sun 04:30 UTC	inline
Exercise reminders	5x daily, Fri-Mon	`/opt/scripts/exercise-cron.sh`
Audit digest	Daily 21:00 UTC	`/opt/scripts/audit-digest.sh`

Updating Scripts

Scripts at /opt/scripts/ are read-only (from the image). To update them:

Commit changes to the secure-agent-kali repo
GHA rebuilds and pushes the image to GHCR
Restart the pod: kubectl rollout restart deployment/secure-agent-pod -n secure-agent-pod

The crontab on the PVC is independent — editing ~/.crontab takes effect immediately via supercronic’s file watcher.

Health Monitoring

Each cron script pushes a heartbeat metric to Prometheus Pushgateway after successful execution:

# Check current heartbeat state
curl -s http://pushgateway.monitoring.svc.cluster.local:9091/api/v1/metrics | \
  python3 -c "
import json, sys
from datetime import datetime, timezone
for g in json.load(sys.stdin)['data']:
    job = g['labels'].get('job','?')
    ts = float(g['push_time_seconds']['metrics'][0]['value'])
    dt = datetime.fromtimestamp(ts, tz=timezone.utc).strftime('%H:%M:%S UTC')
    print(f'{job:30s} {dt}')
"

Grafana alert rules fire when heartbeats go stale:

Alert	Threshold	Severity
`exercise-reminder-stale`	3 hours	critical
`audit-digest-stale`	26 hours	warning
`session-manager-stale`	10 minutes	critical

Alerts are bridged to GitHub Issues via the health-bridge webhook — the Quartermaster (Willikins staff) tracks these on the “Derio Ops” project board.

Manually Pushing a Heartbeat

/opt/scripts/push-heartbeat.sh <job_name> [label=value ...]
# Example: /opt/scripts/push-heartbeat.sh exercise_reminder context=desk

Cilium Egress Policy

Current status: Temporarily disabled due to Cilium 1.17 FQDN LRU bug.

The policy manifest is at apps/secure-agent-pod/cilium-egress.yaml.disabled. To re-enable:

# Move back to manifests directory
mv apps/secure-agent-pod/cilium-egress.yaml.disabled apps/secure-agent-pod/manifests/cilium-egress.yaml
git add -A && git commit -m "feat(agents): re-enable Cilium egress policy" && git push

# Verify policy status
kubectl get ciliumnetworkpolicy -n secure-agent-pod
# VALID column should be True

If the policy shows VALID: False with “LRU not yet initialized”:

# Restart Cilium agent on gpu-1
kubectl delete pod -n kube-system -l k8s-app=cilium --field-selector spec.nodeName=gpu-1

# Wait for agent restart, then delete and reapply
kubectl delete ciliumnetworkpolicy agent-egress -n secure-agent-pod
kubectl apply -f apps/secure-agent-pod/manifests/cilium-egress.yaml

# Delete the pod to clear stale BPF state
kubectl delete pod -l app=secure-agent-pod -n secure-agent-pod

Testing Egress (When Policy is Active)

# Should SUCCEED
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- \
  curl -s --connect-timeout 5 -o /dev/null -w "%{http_code}" https://api.anthropic.com/
# Expected: 404

# Should FAIL (blocked)
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -c kali -- \
  curl -s --connect-timeout 5 https://httpbin.org/ip
# Expected: timeout

Troubleshooting

CrashLoopBackOff

Check which process died:

kubectl logs -n secure-agent-pod deploy/secure-agent-pod -c kali --previous

Common causes:

“Download failed” — VibeKanban can’t reach npm-cdn.vibekanban.com (Cilium blocking or DNS issue)
“No such file or directory: /entrypoint.sh” — image didn’t include the entrypoint (rebuild needed)
sshd fails — check host key permissions (chmod 600 on private keys, chmod 700 on .ssh-host-keys/)

Pod Stuck in CreateContainerConfigError

A referenced Secret doesn’t exist:

kubectl describe pod -l app=secure-agent-pod -n secure-agent-pod | grep -A5 "Warning"

The agent-secrets-tier1 and agent-secrets-tier2 secretRefs are optional: true, so they won’t block. But agent-ssh-keys is required — create it if missing:

kubectl create secret generic agent-ssh-keys \
  --namespace=secure-agent-pod \
  --from-file=authorized_keys=~/.ssh/id_rsa.pub

Can’t SSH In

Check pod is running: kubectl get pods -n secure-agent-pod
Check sshd process: kubectl exec ... -- ps aux | grep sshd
Check service IP: kubectl get svc -n secure-agent-pod — verify 192.168.55.215 is assigned
Check authorized_keys: kubectl exec ... -- cat /home/claude/.ssh/authorized_keys
Check sshd logs: kubectl logs ... | grep sshd

VibeKanban Not Accessible

Check process: kubectl exec ... -- pgrep -f vibe-kanban
Check port: kubectl exec ... -- curl -s http://127.0.0.1:8081 — should return HTML
Check service: kubectl get svc secure-agent-vibekanban -n secure-agent-pod
Check env vars: kubectl exec ... -- env | grep -E "PORT|HOST" — should show PORT=8081, HOST=0.0.0.0

Env Vars Not Updated After Secret Change

Environment variables are set at container start. After changing a Secret:

kubectl rollout restart deployment/secure-agent-pod -n secure-agent-pod

References

Operating on Workflow Automation Operating on Health Monitoring