Operating on Observability

This is the operational companion to Building Observability. That post covers the architecture decisions and deployment gotchas. This one covers what you actually type when you need to find out why something is broken, slow, or eating memory.

Overview

Frank’s observability stack has four moving parts:

VictoriaMetrics (VMSingle + vmagent) – time-series metrics database and scraping engine, running in the monitoring namespace with a 20Gi Longhorn PVC and 1-month retention
Grafana at http://192.168.55.203 – dashboards and exploration, with OIDC auth via Authentik
Fluent Bit – DaemonSet on all nodes (including tainted control-plane and GPU nodes), shipping container logs
VictoriaLogs – log storage with 14-day retention, queryable through Grafana’s Explore tab

Supporting collectors: node-exporter (hardware metrics on all nodes) and kube-state-metrics (Kubernetes object metrics).

Observing State

Grafana Dashboards

Open http://192.168.55.203 in a browser. The stack ships with pre-built dashboards under the “VictoriaMetrics” folder:

Node Exporter Full – per-node CPU, memory, disk I/O, network, filesystem
Kubernetes / Compute Resources / Cluster – cluster-wide CPU and memory requests vs limits vs actual usage
Kubernetes / Compute Resources / Namespace – the same, broken down by namespace
VMAgent – scrape targets, samples/sec, queue depth

These dashboards are provisioned by the Helm chart and survive Grafana pod restarts.

Querying Metrics with MetricsQL

For ad-hoc metric exploration, port-forward to VMSingle and use its built-in UI:

kubectl port-forward -n monitoring svc/vmsingle-victoria-metrics-victoria-metrics-k8s-stack 8429:8429

Then open http://localhost:8429/vmui in your browser. MetricsQL is a superset of PromQL – any PromQL query works, plus extensions like keep_metric_names and range_median.

Some useful starter queries:

# CPU usage by node (1m average)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)

# Memory usage percentage by node
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Pod restart counts in the last hour
increase(kube_pod_container_status_restarts_total[1h]) > 0

# Disk usage on Longhorn volumes
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100

You can also query from the command line with curl:

kubectl port-forward -n monitoring svc/vmsingle-victoria-metrics-victoria-metrics-k8s-stack 8429:8429 &
curl -s 'http://localhost:8429/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, up: .value[1]}'

Querying Logs with VictoriaLogs

Logs are queryable through Grafana’s Explore tab – select the “VictoriaLogs” datasource and use LogsQL syntax:

# All logs from a namespace
{kubernetes_namespace_name="argocd"}

# Logs from a specific pod
{kubernetes_pod_name=~"victoria-metrics.*"}

# Error lines across the entire cluster
{kubernetes_namespace_name=~".+"} |= "error"

# Logs from the GPU node
{kubernetes_host="gpu-1"} | level:error

For CLI access, port-forward to VictoriaLogs directly:

kubectl port-forward -n monitoring svc/victoria-logs-victoria-logs-single-server 9428:9428
curl -s 'http://localhost:9428/select/logsql/query?query={kubernetes_namespace_name="monitoring"}&limit=10' | jq .

Checking Pipeline Health

Verify all pieces are running:

# vmagent is scraping
kubectl get pods -n monitoring -l app.kubernetes.io/name=vmagent
kubectl logs -n monitoring -l app.kubernetes.io/name=vmagent --tail=5

# Fluent Bit is running on all nodes
kubectl get ds -n monitoring fluent-bit
# DESIRED and READY counts should match (7 nodes)

# VictoriaLogs is accepting writes
kubectl logs -n monitoring -l app=victoria-logs-single-server --tail=5

Routine Operations

Creating and Importing Grafana Dashboards

To import a community dashboard (for example, dashboard ID 1860 for Node Exporter Full):

Open Grafana at http://192.168.55.203
Go to Dashboards > Import
Enter the dashboard ID and click Load
Select the VictoriaMetrics datasource and click Import

To make an imported dashboard persistent across pod restarts, Grafana’s 1Gi Longhorn PVC handles that automatically. Dashboards saved in the UI are written to the PVC and survive restarts.

Adjusting Retention

Metrics retention is set in apps/victoria-metrics/values.yaml:

vmsingle:
  spec:
    retentionPeriod: "1"  # 1 month

Log retention is in apps/victoria-logs/values.yaml:

server:
  retentionPeriod: 14d

Change the value, commit, and let ArgoCD sync. The pods will restart with the new retention window. Existing data outside the new window is garbage-collected on the next retention pass.

Checking What vmagent Is Scraping

vmagent exposes its target list via its own UI:

kubectl port-forward -n monitoring svc/vmagent-victoria-metrics-victoria-metrics-k8s-stack 8429:8429

Open http://localhost:8429/targets to see every scrape target, its status (up/down), last scrape time, and error messages. This is the first place to look when a metric is missing.

Exploring Available Metrics

To find what metrics exist:

# List all metric names
curl -s 'http://localhost:8429/api/v1/label/__name__/values' | jq '.data[:20]'

# Search for metrics by keyword
curl -s 'http://localhost:8429/api/v1/label/__name__/values' | jq '.data[] | select(test("gpu|nvidia"))'

Debugging

Missing Metrics

If a metric you expect is not showing up:

Check the scrape target – is the exporter pod running?

kubectl get pods -n monitoring -l app.kubernetes.io/name=node-exporter
kubectl get pods -n monitoring -l app.kubernetes.io/name=kube-state-metrics

Check vmagent targets – is it scraping the endpoint?

kubectl port-forward -n monitoring svc/vmagent-victoria-metrics-victoria-metrics-k8s-stack 8429:8429
# Open http://localhost:8429/targets and look for the target

Check VMServiceMonitor – does the CRD exist and match the service labels?

kubectl get vmservicemonitors -n monitoring
kubectl describe vmservicemonitor <name> -n monitoring

Check the exporter directly – does it actually expose the metric?

kubectl port-forward -n monitoring <exporter-pod> <port>:<port>
curl http://localhost:<port>/metrics | grep <metric-name>

Fluent Bit Not Shipping Logs

If logs are not appearing in VictoriaLogs:

Check Fluent Bit pods – are they running on all nodes?

kubectl get ds -n monitoring fluent-bit
kubectl get pods -n monitoring -l app.kubernetes.io/name=fluent-bit -o wide

Check Fluent Bit logs for output errors:
```
kubectl logs -n monitoring -l app.kubernetes.io/name=fluent-bit --tail=50
```
Look for retry lines. Silent retries with no error detail usually mean DNS resolution failure – the output hostname is wrong or the target service is down.

Verify the destination hostname resolves:

kubectl exec -n monitoring <fluent-bit-pod> -- nslookup \
  victoria-logs-victoria-logs-single-server.monitoring.svc.cluster.local

Check tail file positions – Fluent Bit tracks where it left off reading each log file. If positions are stale, it may be re-reading or skipping:
```
kubectl exec -n monitoring <fluent-bit-pod> -- ls -la /var/log/flb_kube.db
```

High Cardinality

If VMSingle memory usage is climbing or queries are slow, high cardinality labels are usually the cause:

# Check top series by cardinality
curl -s 'http://localhost:8429/api/v1/status/tsdb' | jq '.data.seriesCountByMetricName[:10]'

If a metric has an unbounded label (like a request ID or session token), either drop the label in vmagent’s relabeling config or exclude the metric entirely.

VictoriaLogs Query Returns No Results

Check VictoriaLogs is receiving data:

kubectl port-forward -n monitoring svc/victoria-logs-victoria-logs-single-server 9428:9428
curl -s 'http://localhost:9428/select/logsql/query?query=*&limit=5' | jq .

If this returns results, the problem is your query syntax, not the pipeline.

Check the Grafana datasource – the VictoriaLogs datasource must point to http://victoria-logs-victoria-logs-single-server.monitoring.svc.cluster.local:9428. Go to Grafana > Configuration > Data Sources and verify.
Check retention – if logs are older than 14 days, they have been garbage-collected.

Quick Reference

Command	What It Does
`kubectl port-forward -n monitoring svc/vmsingle-... 8429:8429`	Access VMSingle UI and API
`kubectl port-forward -n monitoring svc/victoria-logs-...-server 9428:9428`	Access VictoriaLogs API
`kubectl get ds -n monitoring fluent-bit`	Check Fluent Bit DaemonSet status
`kubectl logs -n monitoring -l app.kubernetes.io/name=fluent-bit --tail=50`	Fluent Bit output logs
`kubectl logs -n monitoring -l app.kubernetes.io/name=vmagent --tail=50`	vmagent scrape logs
`kubectl get vmservicemonitors -n monitoring`	List all metric scrape configs
`curl localhost:8429/api/v1/query?query=up`	Query metrics via API
`curl localhost:9428/select/logsql/query?query=*&limit=10`	Query logs via API
`curl localhost:8429/targets`	List vmagent scrape targets
`curl localhost:8429/api/v1/status/tsdb`	TSDB cardinality stats

References

VictoriaMetrics Documentation – MetricsQL reference, VMSingle operations, retention
VictoriaLogs Documentation – LogsQL syntax, ingestion API
Grafana Documentation – Dashboard management, datasource provisioning
Fluent Bit Documentation – Pipeline debugging, tail input, HTTP output
Building Observability – Architecture decisions and deployment gotchas

Operating on GPU Compute Operating on Secrets