Skip to content
Operating on Health Bridge
Operating on Health Bridge

Operating on Health Bridge

Companion to Health Bridge — Closing the Loop from Grafana Alerts to GitHub Issues.

Quick Reference

ComponentNamespacePortPurpose
health-bridgemonitoring8080Grafana webhook → GitHub lifecycle updates
Webhook endpointPOST /webhook (Bearer auth)
Health checkGET /healthz
Readiness checkGET /readyz

Checking Service Status

# Pod status
kubectl get pods -n monitoring -l app=health-bridge

# Recent logs
kubectl logs -n monitoring -l app=health-bridge --tail=20

# Check readiness (should show project metadata loaded)
kubectl logs -n monitoring -l app=health-bridge | grep "Loaded project metadata"
# Expected: Loaded project metadata: id=..., field=..., 10 lifecycle states

Testing the Webhook

# Get webhook secret
WEBHOOK_SECRET=$(kubectl get secret -n monitoring health-bridge-secrets \
  -o jsonpath='{.data.WEBHOOK_SECRET}' | base64 -d)

# Send a test alert (warning → degraded)
curl -s -X POST http://health-bridge.monitoring.svc.cluster.local:8080/webhook \
  -H "Authorization: Bearer $WEBHOOK_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "status": "firing",
    "alerts": [{
      "status": "firing",
      "labels": {"alertname": "test-bridge", "severity": "warning", "github_issue": "willikins#11"},
      "annotations": {"summary": "Manual test alert"},
      "startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
    }]
  }'
# Expected: {"processed": 1, "total": 1}

# Send a resolved alert to restore healthy state
curl -s -X POST http://health-bridge.monitoring.svc.cluster.local:8080/webhook \
  -H "Authorization: Bearer $WEBHOOK_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "status": "resolved",
    "alerts": [{
      "status": "resolved",
      "labels": {"alertname": "test-bridge", "severity": "warning", "github_issue": "willikins#11"},
      "annotations": {"summary": "Manual test resolved"},
      "startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'",
      "endsAt": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
    }]
  }'
# Expected: {"processed": 1, "total": 1}

Checking ExternalSecret Sync

# Verify secrets are synced from Infisical
kubectl get externalsecret -n monitoring health-bridge-secrets
# Expected: STATUS=SecretSynced

# Check secret keys exist (don't print values)
kubectl get secret -n monitoring health-bridge-secrets -o jsonpath='{.data}' | jq 'keys'
# Expected: ["GITHUB_TOKEN", "WEBHOOK_SECRET"]

Managing Alert Rule Labels

Alert rules need a github_issue label for the bridge to process them. Current mappings:

Alert Rule UIDgithub_issue
exercise-reminder-stalewillikins#11
session-manager-stalewillikins#13
audit-digest-stalewillikins#12
agent-pod-not-runningfrank#8
endpoint-down(none — future work)
GRAFANA_AUTH="admin:$(kubectl get secret -n monitoring victoria-metrics-grafana \
  -o jsonpath='{.data.admin-password}' | base64 -d)"

# List all rules with their github_issue labels
curl -s -u "$GRAFANA_AUTH" \
  "https://grafana.frank.derio.net/api/v1/provisioning/alert-rules" | \
  jq '.[] | {title: .title, uid: .uid, github_issue: .labels.github_issue}'

# Add or update a github_issue label on a rule
RULE_UID="exercise-reminder-stale"
ISSUE="willikins#11"
RULE=$(curl -s -u "$GRAFANA_AUTH" \
  "https://grafana.frank.derio.net/api/v1/provisioning/alert-rules/$RULE_UID")
UPDATED=$(echo "$RULE" | jq --arg issue "$ISSUE" '.labels.github_issue = $issue')
curl -s -X PUT "https://grafana.frank.derio.net/api/v1/provisioning/alert-rules/$RULE_UID" \
  -u "$GRAFANA_AUTH" \
  -H "Content-Type: application/json" \
  -d "$UPDATED"

Managing the Grafana Contact Point

GRAFANA_AUTH="admin:$(kubectl get secret -n monitoring victoria-metrics-grafana \
  -o jsonpath='{.data.admin-password}' | base64 -d)"

# List contact points
curl -s -u "$GRAFANA_AUTH" \
  "https://grafana.frank.derio.net/api/v1/provisioning/contact-points" | \
  jq '.[] | {uid: .uid, name: .name, type: .type}'

# Check notification policy routing
curl -s -u "$GRAFANA_AUTH" \
  "https://grafana.frank.derio.net/api/v1/provisioning/policies" | jq .

Verifying GitHub Integration

# Check a specific issue's lifecycle state on the project board
gh issue view 11 --repo derio-net/willikins --json projectItems \
  --jq '.projectItems[]'

# Check recent comments added by the bridge
gh issue view 11 --repo derio-net/willikins --json comments \
  --jq '.comments[] | select(.body | contains("health-bridge")) | {createdAt, body}'

Troubleshooting

Bridge not processing alerts

  1. Check pod logs for errors:

    kubectl logs -n monitoring -l app=health-bridge --tail=50
  2. Verify the webhook contact point exists in Grafana:

    curl -s -u "$GRAFANA_AUTH" \
      "https://grafana.frank.derio.net/api/v1/provisioning/contact-points" | \
      jq '.[] | select(.name == "Health Bridge Webhook")'
  3. Verify the notification policy routes Feature Health alerts:

    curl -s -u "$GRAFANA_AUTH" \
      "https://grafana.frank.derio.net/api/v1/provisioning/policies" | \
      jq '.routes[] | select(.receiver == "Health Bridge Webhook")'

“not ready” on readiness probe

The bridge couldn’t load project metadata from GitHub on startup. Check:

# Pod logs will show the error
kubectl logs -n monitoring -l app=health-bridge | head -5

# Common causes:
# - GITHUB_TOKEN expired or missing scopes (needs repo, project, read:org)
# - Project number wrong (check PROJECT_NUMBER in configmap)
# - GitHub API rate limit hit

Alerts skip the bridge (no github_issue label)

Bridge logs show Alert <name> has no github_issue label, skipping. Add the label to the alert rule — see “Managing Alert Rule Labels” above.

Duplicate bug issues appearing

Symptom: Multiple identical [Bug] ... is dead issues created for the same alert.

Cause: Before v0.2.0, the bridge had no dedup logic. If you’re running v0.1.0 or earlier, upgrade.

If running v0.2.0+: This can happen once after a pod restart (in-memory state is lost). The GitHub search safety net should prevent all but the first duplicate. If duplicates persist, check pod restart frequency.

Cleanup: Close duplicates with gh issue close <number> --repo derio-net/<repo> --comment "Duplicate", keeping the earliest one open.

GitHub API errors

# Check for GitHub API errors in logs
kubectl logs -n monitoring -l app=health-bridge | grep -i error

# Verify GitHub token scopes (from outside the cluster)
curl -sI -H "Authorization: Bearer $(kubectl get secret -n monitoring health-bridge-secrets \
  -o jsonpath='{.data.GITHUB_TOKEN}' | base64 -d)" \
  https://api.github.com/ | grep -i x-oauth-scopes

Updating the Bridge

# In the health-bridge repo:
# 1. Make changes, run tests
go test -v ./...

# 2. Tag and push
git tag v0.2.0
git push origin v0.2.0
# GitHub Actions builds and pushes to GHCR

# 3. Update the image tag in frank repo
# Edit apps/health-bridge/manifests/deployment.yaml
# Change: image: ghcr.io/derio-net/health-bridge:v0.2.0
# Commit and push — ArgoCD syncs automatically

Layer trackers (Pass 3)

As of 2026-04-20, the 20 Layer tracker Issues on the Derio Ops board were relocated from the public derio-net/frank to the private derio-net/frank-ops repo, with Issue numbers aligned 1:1 to Layer numbers (so frank-ops#13 is Layer 13 Authentik). Each Layer has one Grafana alert rule with github_issue: "frank-ops#<LAYER>" driving its Lifecycle field automatically.

Smoke-testing a Layer via direct webhook

The direct-Bridge test bypasses Grafana’s rule evaluation, which is handy for verifying the Bridge + GitHub path without waiting for a real metric to dip:

export WEBHOOK_SECRET=$(kubectl get secret -n monitoring health-bridge-secrets \
  -o jsonpath='{.data.WEBHOOK_SECRET}' | base64 -d)
kubectl port-forward -n monitoring svc/health-bridge 8080:8080 &

# Fire a critical alert at Layer 13 (Authentik)
curl -s -X POST http://localhost:8080/webhook \
  -H "Authorization: Bearer $WEBHOOK_SECRET" -H "Content-Type: application/json" \
  -d '{"status":"firing","alerts":[{
    "status":"firing",
    "labels":{"alertname":"smoke","severity":"critical","github_issue":"frank-ops#13"},
    "annotations":{"summary":"Smoke test"},
    "startsAt":"2026-04-20T00:00:00Z"
  }]}'
# Response: {"processed": 1, "total": 1}

Checking a Layer’s current Lifecycle state

gh api graphql -f query='
{
  repository(owner:"derio-net", name:"frank-ops") {
    issue(number:13) {
      projectItems(first:5) {
        nodes {
          fieldValueByName(name:"Lifecycle") {
            ... on ProjectV2ItemFieldSingleSelectValue { name }
          }
        }
      }
    }
  }
}' --jq '.data.repository.issue.projectItems.nodes[].fieldValueByName.name'
# → healthy  (or degraded, dead, etc.)

Reloading rules after editing the ConfigMap

Grafana’s provisioning files are read at boot, not watched. After editing apps/grafana-alerting/manifests/alert-rules-cm.yaml:

git add apps/grafana-alerting/manifests/alert-rules-cm.yaml
git commit -m "feat(obs): ..."
git push origin main

# Wait for ArgoCD to sync the ConfigMap
kubectl annotate application -n argocd grafana-alerting \
  argocd.argoproj.io/refresh=hard --overwrite

# Restart Grafana to pick up the new ConfigMap
kubectl delete pod -n monitoring -l app.kubernetes.io/name=grafana

Two gotchas learned the hard way:

  1. RWO PVC + RollingUpdate deadlock. Grafana’s PVC is ReadWriteOnce. When the Deployment rolls due to a ConfigMap checksum change, the new pod can’t mount the volume while the old pod holds it. If the rollout hangs, scale the Deployment to 0 briefly to force a detach, then back up. A more durable fix (switch strategy.type to Recreate) is tracked as a follow-up.
  2. Listen for parseError in the new pod’s logs before trusting that a rule change took effect:
    kubectl logs -n monitoring -l app.kubernetes.io/name=grafana --tail=200 | grep -iE 'parseError|provisioning.*error'

Verifying a rule is loaded via the Grafana API

GRAFANA_POD=$(kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana \
  -o jsonpath='{.items[0].metadata.name}')
ADMIN_PASS=$(kubectl get secret -n monitoring victoria-metrics-grafana \
  -o jsonpath='{.data.admin-password}' | base64 -d)

kubectl exec -n monitoring "$GRAFANA_POD" -c grafana -- \
  curl -s -u admin:"$ADMIN_PASS" \
  http://localhost:3000/api/v1/provisioning/alert-rules/layer-13-auth-down \
  | jq '{uid, title, labels, annotations}'