
Operating on Observability
This is the operational companion to Building Observability. That post covers the architecture decisions and deployment gotchas. This one covers what you actually type when you need to find out why something is broken, slow, or eating memory.
Overview
Frank’s observability stack has four moving parts:
- VictoriaMetrics (VMSingle + vmagent) – time-series metrics database and scraping engine, running in the
monitoringnamespace with a 20Gi Longhorn PVC and 1-month retention - Grafana at
http://192.168.55.203– dashboards and exploration, with OIDC auth via Authentik - Fluent Bit – DaemonSet on all nodes (including tainted control-plane and GPU nodes), shipping container logs
- VictoriaLogs – log storage with 14-day retention, queryable through Grafana’s Explore tab
Supporting collectors: node-exporter (hardware metrics on all nodes) and kube-state-metrics (Kubernetes object metrics).
Observing State
Grafana Dashboards
Open http://192.168.55.203 in a browser. The stack ships with pre-built dashboards under the “VictoriaMetrics” folder:
- Node Exporter Full – per-node CPU, memory, disk I/O, network, filesystem
- Kubernetes / Compute Resources / Cluster – cluster-wide CPU and memory requests vs limits vs actual usage
- Kubernetes / Compute Resources / Namespace – the same, broken down by namespace
- VMAgent – scrape targets, samples/sec, queue depth
These dashboards are provisioned by the Helm chart and survive Grafana pod restarts.
Querying Metrics with MetricsQL
For ad-hoc metric exploration, port-forward to VMSingle and use its built-in UI:
kubectl port-forward -n monitoring svc/vmsingle-victoria-metrics-victoria-metrics-k8s-stack 8429:8429Then open http://localhost:8429/vmui in your browser. MetricsQL is a superset of PromQL – any PromQL query works, plus extensions like keep_metric_names and range_median.
Some useful starter queries:
# CPU usage by node (1m average)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)
# Memory usage percentage by node
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Pod restart counts in the last hour
increase(kube_pod_container_status_restarts_total[1h]) > 0
# Disk usage on Longhorn volumes
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100You can also query from the command line with curl:
kubectl port-forward -n monitoring svc/vmsingle-victoria-metrics-victoria-metrics-k8s-stack 8429:8429 &
curl -s 'http://localhost:8429/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, up: .value[1]}'Querying Logs with VictoriaLogs
Logs are queryable through Grafana’s Explore tab – select the “VictoriaLogs” datasource and use LogsQL syntax:
# All logs from a namespace
{kubernetes_namespace_name="argocd"}
# Logs from a specific pod
{kubernetes_pod_name=~"victoria-metrics.*"}
# Error lines across the entire cluster
{kubernetes_namespace_name=~".+"} |= "error"
# Logs from the GPU node
{kubernetes_host="gpu-1"} | level:errorFor CLI access, port-forward to VictoriaLogs directly:
kubectl port-forward -n monitoring svc/victoria-logs-victoria-logs-single-server 9428:9428
curl -s 'http://localhost:9428/select/logsql/query?query={kubernetes_namespace_name="monitoring"}&limit=10' | jq .Checking Pipeline Health
Verify all pieces are running:
# vmagent is scraping
kubectl get pods -n monitoring -l app.kubernetes.io/name=vmagent
kubectl logs -n monitoring -l app.kubernetes.io/name=vmagent --tail=5
# Fluent Bit is running on all nodes
kubectl get ds -n monitoring fluent-bit
# DESIRED and READY counts should match (7 nodes)
# VictoriaLogs is accepting writes
kubectl logs -n monitoring -l app=victoria-logs-single-server --tail=5Routine Operations
Creating and Importing Grafana Dashboards
To import a community dashboard (for example, dashboard ID 1860 for Node Exporter Full):
- Open Grafana at
http://192.168.55.203 - Go to Dashboards > Import
- Enter the dashboard ID and click Load
- Select the VictoriaMetrics datasource and click Import
To make an imported dashboard persistent across pod restarts, Grafana’s 1Gi Longhorn PVC handles that automatically. Dashboards saved in the UI are written to the PVC and survive restarts.
Adjusting Retention
Metrics retention is set in apps/victoria-metrics/values.yaml:
vmsingle:
spec:
retentionPeriod: "1" # 1 monthLog retention is in apps/victoria-logs/values.yaml:
server:
retentionPeriod: 14dChange the value, commit, and let ArgoCD sync. The pods will restart with the new retention window. Existing data outside the new window is garbage-collected on the next retention pass.
Checking What vmagent Is Scraping
vmagent exposes its target list via its own UI:
kubectl port-forward -n monitoring svc/vmagent-victoria-metrics-victoria-metrics-k8s-stack 8429:8429Open http://localhost:8429/targets to see every scrape target, its status (up/down), last scrape time, and error messages. This is the first place to look when a metric is missing.
Exploring Available Metrics
To find what metrics exist:
# List all metric names
curl -s 'http://localhost:8429/api/v1/label/__name__/values' | jq '.data[:20]'
# Search for metrics by keyword
curl -s 'http://localhost:8429/api/v1/label/__name__/values' | jq '.data[] | select(test("gpu|nvidia"))'Debugging
Missing Metrics
If a metric you expect is not showing up:
Check the scrape target – is the exporter pod running?
kubectl get pods -n monitoring -l app.kubernetes.io/name=node-exporter kubectl get pods -n monitoring -l app.kubernetes.io/name=kube-state-metricsCheck vmagent targets – is it scraping the endpoint?
kubectl port-forward -n monitoring svc/vmagent-victoria-metrics-victoria-metrics-k8s-stack 8429:8429 # Open http://localhost:8429/targets and look for the targetCheck VMServiceMonitor – does the CRD exist and match the service labels?
kubectl get vmservicemonitors -n monitoring kubectl describe vmservicemonitor <name> -n monitoringCheck the exporter directly – does it actually expose the metric?
kubectl port-forward -n monitoring <exporter-pod> <port>:<port> curl http://localhost:<port>/metrics | grep <metric-name>
Fluent Bit Not Shipping Logs
If logs are not appearing in VictoriaLogs:
Check Fluent Bit pods – are they running on all nodes?
kubectl get ds -n monitoring fluent-bit kubectl get pods -n monitoring -l app.kubernetes.io/name=fluent-bit -o wideCheck Fluent Bit logs for output errors:
kubectl logs -n monitoring -l app.kubernetes.io/name=fluent-bit --tail=50Look for
retrylines. Silent retries with no error detail usually mean DNS resolution failure – the output hostname is wrong or the target service is down.Verify the destination hostname resolves:
kubectl exec -n monitoring <fluent-bit-pod> -- nslookup \ victoria-logs-victoria-logs-single-server.monitoring.svc.cluster.localCheck tail file positions – Fluent Bit tracks where it left off reading each log file. If positions are stale, it may be re-reading or skipping:
kubectl exec -n monitoring <fluent-bit-pod> -- ls -la /var/log/flb_kube.db
High Cardinality
If VMSingle memory usage is climbing or queries are slow, high cardinality labels are usually the cause:
# Check top series by cardinality
curl -s 'http://localhost:8429/api/v1/status/tsdb' | jq '.data.seriesCountByMetricName[:10]'If a metric has an unbounded label (like a request ID or session token), either drop the label in vmagent’s relabeling config or exclude the metric entirely.
VictoriaLogs Query Returns No Results
Check VictoriaLogs is receiving data:
kubectl port-forward -n monitoring svc/victoria-logs-victoria-logs-single-server 9428:9428 curl -s 'http://localhost:9428/select/logsql/query?query=*&limit=5' | jq .If this returns results, the problem is your query syntax, not the pipeline.
Check the Grafana datasource – the VictoriaLogs datasource must point to
http://victoria-logs-victoria-logs-single-server.monitoring.svc.cluster.local:9428. Go to Grafana > Configuration > Data Sources and verify.Check retention – if logs are older than 14 days, they have been garbage-collected.
Quick Reference
| Command | What It Does |
|---|---|
kubectl port-forward -n monitoring svc/vmsingle-... 8429:8429 | Access VMSingle UI and API |
kubectl port-forward -n monitoring svc/victoria-logs-...-server 9428:9428 | Access VictoriaLogs API |
kubectl get ds -n monitoring fluent-bit | Check Fluent Bit DaemonSet status |
kubectl logs -n monitoring -l app.kubernetes.io/name=fluent-bit --tail=50 | Fluent Bit output logs |
kubectl logs -n monitoring -l app.kubernetes.io/name=vmagent --tail=50 | vmagent scrape logs |
kubectl get vmservicemonitors -n monitoring | List all metric scrape configs |
curl localhost:8429/api/v1/query?query=up | Query metrics via API |
curl localhost:9428/select/logsql/query?query=*&limit=10 | Query logs via API |
curl localhost:8429/targets | List vmagent scrape targets |
curl localhost:8429/api/v1/status/tsdb | TSDB cardinality stats |
References
- VictoriaMetrics Documentation – MetricsQL reference, VMSingle operations, retention
- VictoriaLogs Documentation – LogsQL syntax, ingestion API
- Grafana Documentation – Dashboard management, datasource provisioning
- Fluent Bit Documentation – Pipeline debugging, tail input, HTTP output
- Building Observability – Architecture decisions and deployment gotchas