Quick Reference

Component	Namespace	Port	Purpose
Blackbox Exporter	monitoring	9115	HTTP endpoint probing
Pushgateway	monitoring	9091	Heartbeat metric ingestion
Grafana	monitoring	3000 (LB: 192.168.55.203)	Dashboards + alerting
Feature Health Dashboard	—	—	`/d/fh-overview/feature-health`

Checking Probe Status

# Port-forward to Blackbox Exporter
kubectl port-forward -n monitoring svc/blackbox-exporter 9115:9115 &

# Probe a specific endpoint
curl -s "http://localhost:9115/probe?target=https://grafana.frank.derio.net&module=http_2xx" | grep probe_success
# Expected: probe_success 1

# Check all feature health probes via VictoriaMetrics
GRAFANA_AUTH="admin:$(kubectl get secret -n monitoring victoria-metrics-grafana -o jsonpath='{.data.admin-password}' | base64 -d)"
curl -sk -u "$GRAFANA_AUTH" "https://grafana.frank.derio.net/api/datasources/proxy/uid/P4169E866C3094E38/api/v1/query" \
  --data-urlencode 'query=probe_success{probe_group="feature_health"}'

Checking Heartbeat Metrics

# Port-forward to Pushgateway
kubectl port-forward -n monitoring svc/pushgateway 9091:9091 &

# View all heartbeat metrics
curl -s http://localhost:9091/metrics | grep willikins_heartbeat

# Push a test heartbeat
echo "willikins_heartbeat_last_success_timestamp $(date +%s)" | \
  curl -s --data-binary @- http://localhost:9091/metrics/job/test_job

# Delete a test metric
curl -s -X DELETE http://localhost:9091/metrics/job/test_job

File-Provisioned Alerting (as-code)

As of April 2026, all Grafana alerting configuration is file-provisioned via ConfigMaps in apps/grafana-alerting/manifests/:

ConfigMap	Provisioning Path	Contents
`grafana-alerting-rules`	`/etc/grafana/provisioning/alerting/alert-rules.yaml`	5 alert rules in 5 groups
`grafana-alerting-contact-points`	`/etc/grafana/provisioning/alerting/contact-points.yaml`	Telegram + Health Bridge webhook
`grafana-alerting-notification-policy`	`/etc/grafana/provisioning/alerting/notification-policy.yaml`	Severity-based routing tree
`grafana-alerting-dashboard`	`/etc/grafana/provisioning/dashboards/` + `/var/lib/grafana/dashboards/feature-health/`	Feature Health dashboard

Editing Alert Rules

File-provisioned rules are read-only in the UI. To modify:

Edit the ConfigMap YAML in apps/grafana-alerting/manifests/alert-rules-cm.yaml
Commit and push — ArgoCD syncs the ConfigMap

Restart Grafana pod to reload provisioning files:

kubectl delete pod -n monitoring -l app.kubernetes.io/name=grafana

Editing the Dashboard

Open the provisioned dashboard in Grafana UI, click “Save as” to create a scratch copy
Edit the scratch copy freely in the UI
Export the final JSON (Share → Export → Save to file)
Replace the feature-health.json content in apps/grafana-alerting/manifests/dashboard-cm.yaml
Commit, push, restart Grafana pod
Delete the scratch dashboard

Grafana Alert Management

Historical: The curl commands below were used when alerts were API-provisioned. Since April 2026, alerting is file-provisioned via ConfigMaps. See File-Provisioned Alerting above. These commands still work for reading alert state but not for modifying rules.

GRAFANA_AUTH="admin:$(kubectl get secret -n monitoring victoria-metrics-grafana -o jsonpath='{.data.admin-password}' | base64 -d)"

# List all alert states
curl -sk -u "$GRAFANA_AUTH" \
  "https://grafana.frank.derio.net/api/prometheus/grafana/api/v1/alerts" | \
  python3 -c "import json,sys; [print(f'{a[\"state\"]}: {a[\"labels\"][\"alertname\"]}') for a in json.load(sys.stdin)['data']['alerts']]"

# Check alertmanager active alerts
curl -sk -u "$GRAFANA_AUTH" \
  "https://grafana.frank.derio.net/api/alertmanager/grafana/api/v2/alerts" | python3 -m json.tool

# Check notification policies
curl -sk -u "$GRAFANA_AUTH" \
  "https://grafana.frank.derio.net/api/v1/provisioning/policies" | python3 -m json.tool

# View a specific alert rule
curl -sk -u "$GRAFANA_AUTH" \
  "https://grafana.frank.derio.net/api/v1/provisioning/alert-rules/exercise-reminder-stale" | python3 -m json.tool

Alert Rule UIDs

UID	What It Monitors
`exercise-reminder-stale`	Exercise reminder cron heartbeat (threshold: 3h)
`session-manager-stale`	Session manager cron heartbeat (threshold: 10m)
`audit-digest-stale`	Audit digest cron heartbeat (threshold: 26h)
`endpoint-down`	HTTP endpoint probes (any `probe_success=0`)
`agent-pod-not-running`	Secure agent pod not in Running phase

Updating Alert Thresholds

Alert rules use the Grafana 12.x SSE 3-step format (A→B→C). To update a threshold:

# 1. GET the current rule
curl -sk -u "$GRAFANA_AUTH" \
  "https://grafana.frank.derio.net/api/v1/provisioning/alert-rules/<uid>" > /tmp/rule.json

# 2. Edit the threshold in the C refId's conditions[0].evaluator.params
#    (the value is in the model.conditions[0].evaluator.params array)

# 3. PUT it back
curl -sk -u "$GRAFANA_AUTH" -X PUT \
  "https://grafana.frank.derio.net/api/v1/provisioning/alert-rules/<uid>" \
  -H "Content-Type: application/json" \
  -H "X-Provision-Source: api" \
  -d @/tmp/rule.json

Telegram Contact Point

Setting	Value
Contact point UID	`efi04e0201jb4f`
Bot	`@agent_zero_cc_bot`
Token secret	`FRANK_C2_TELEGRAM_BOT_TOKEN` (Infisical)
Chat ID	`FRANK_C2_TELEGRAM_CHAT_ID` (Infisical)

# Update contact point (e.g., after bot token rotation)
curl -sk -u "$GRAFANA_AUTH" -X PUT \
  "https://grafana.frank.derio.net/api/v1/provisioning/contact-points/efi04e0201jb4f" \
  -H "Content-Type: application/json" \
  -H "X-Provision-Source: api" \
  -d '{
    "uid": "efi04e0201jb4f",
    "name": "Telegram - Willikins",
    "type": "telegram",
    "settings": {
      "bottoken": "<FRANK_C2_TELEGRAM_BOT_TOKEN>",
      "chatid": "<FRANK_C2_TELEGRAM_CHAT_ID>",
      "parse_mode": "Markdown"
    }
  }'

Notification Not Arriving?

If a firing alert isn’t reaching Telegram:

Check repeat interval — default grouping suppresses re-notification for the configured repeat_interval
Check contact point — token may have been lost after Grafana pod restart
Nuclear option — restart Grafana pod to reset alertmanager notification dedup state:
```
kubectl delete pod -n monitoring -l app.kubernetes.io/name=grafana
```

Cron Jobs (Supercronic)

The secure-agent-pod runs cron jobs via supercronic watching ~/.crontab:

# Check crontab contents
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -- cat /home/claude/.crontab

# Check supercronic process
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -- ps aux | grep supercronic

# Update crontab (supercronic auto-reloads on file change)
kubectl exec -n secure-agent-pod deploy/secure-agent-pod -- \
  cp /home/claude/repos/willikins/scripts/willikins-agent/crontab.txt /home/claude/.crontab

Pod Health

# Check all monitoring pods
kubectl get pods -n monitoring -l 'app in (blackbox-exporter,pushgateway)'

# Check Blackbox Exporter logs
kubectl logs -n monitoring -l app=blackbox-exporter --tail=20

# Check Pushgateway logs
kubectl logs -n monitoring -l app=pushgateway --tail=20

# Check Grafana logs for alert/notification issues
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana -c grafana --tail=50 | \
  grep -iE "error|warn|notify|telegram"

Troubleshooting

VMProbe/VMServiceScrape not applying

If kubectl apply fails with x509: certificate signed by unknown authority:

# The VictoriaMetrics Operator webhook caBundle is out of sync
# Check if ArgoCD overwrote it:
kubectl get validatingwebhookconfiguration -l app.kubernetes.io/instance=victoria-metrics -o yaml | grep caBundle | head -1

# Fix: ensure ignoreDifferences is set in apps/root/templates/victoria-metrics.yaml
# Then restart the operator to regenerate certs:
kubectl rollout restart deployment -n monitoring victoria-metrics-operator

Dashboard shows no data

Verify datasource UID is P4169E866C3094E38
Table panels require "format": "table" on targets
ALERTS{} metric doesn’t exist for Grafana-managed alerts — use alertlist panel type

References

Operating on Secure Agent Pod Operating on Health Bridge