
Observability — Metrics and Logs with VictoriaMetrics
A Kubernetes cluster without observability is just a box of mystery. Pods crash silently. Memory leaks hide behind restart counts. Network blips become finger-pointing exercises. Layer 7 fixes that: a full metrics and logging stack built around VictoriaMetrics, VictoriaLogs, and Grafana, managed by ArgoCD, backed by Longhorn storage.
Three gotchas made the deployment more interesting than expected. They are documented in full below.
Why VictoriaMetrics Instead of Prometheus?
The standard choice is kube-prometheus-stack — Prometheus, Alertmanager, Grafana, and a bundle of exporters packaged together. It works well and has a huge ecosystem. But it is heavy, and on a homelab cluster where the control-plane nodes are also running workloads, baseline resource consumption matters.
The comparison is not subtle:
| Component | kube-prometheus-stack | victoria-metrics-k8s-stack |
|---|---|---|
| Prometheus / VMSingle | ~1–2 GB RAM | ~50–150 MB RAM |
| Full stack baseline | ~2–4 GB RAM | ~200–400 MB RAM |
| Storage format | TSDB (per-sample blocks) | custom compressed format |
| Ingestion throughput | good | 2–5x higher per benchmark |
| Long-term retention | needs Thanos / Cortex | built-in, single binary |
| PromQL compatibility | native | full (MetricsQL superset) |
VictoriaMetrics is not a drop-in replacement in the sense that it requires rethinking the architecture — but the victoria-metrics-k8s-stack Helm chart is deliberately structured to be familiar to anyone who has used kube-prometheus-stack. It ships the same CRD patterns (VMServiceMonitor, VMPodMonitor, analogous to their Prometheus equivalents), the same Grafana dashboards, and the same node-exporter and kube-state-metrics exporters.
For Frank, the Talos Cluster, the math is easy: fewer wasted gigabytes on the control-plane nodes means more headroom for actual workloads. VictoriaMetrics wins.
The Stack
Four ArgoCD Applications make up the observability layer:
victoria-metrics — The core chart (victoria-metrics-k8s-stack v0.72.4). Deploys:
VMSingle— single-node time-series database with a 20Gi Longhorn PVC and 1-month retentionvmagent— metrics scraping engine; readsVMServiceMonitorandVMPodMonitorCRDsnode-exporter— DaemonSet on all 6 nodes, exposing hardware and OS metricskube-state-metrics— Kubernetes object metrics (pod state, deployment replicas, etc.)- Grafana — with a Longhorn-backed 1Gi PVC for dashboard persistence, exposed at
192.168.55.203
victoria-logs — Separate chart (victoria-logs-single v0.11.28). Deploys a single-node log storage server with a 20Gi Longhorn PVC and 14-day retention, accessible within the cluster at port 9428.
fluent-bit — DaemonSet on all nodes (including GPU and control-plane). Ships container logs from every node to VictoriaLogs via HTTP jsonline.
Note: vmalert and alertmanager are disabled in this layer. Alerting is planned for Layer 9 after the alert rules have been properly tuned. Running alertmanager without tuned rules just produces noise.
ArgoCD Deployment
Three Application CRs live in apps/root/templates/. All three follow the same dual-source pattern: the upstream Helm chart from the VictoriaMetrics or Fluent chart repositories, plus this Git repo as the values reference.
victoria-metrics Application
# apps/root/templates/victoria-metrics.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: victoria-metrics
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: infrastructure
sources:
- repoURL: https://victoriametrics.github.io/helm-charts/
chart: victoria-metrics-k8s-stack
targetRevision: "0.72.4"
helm:
releaseName: victoria-metrics
valueFiles:
- $values/apps/victoria-metrics/values.yaml
- repoURL: {{ .Values.repoURL }}
targetRevision: {{ .Values.targetRevision }}
ref: values
destination:
server: {{ .Values.destination.server }}
namespace: monitoring
syncPolicy:
automated:
prune: false
selfHeal: true
syncOptions:
- ServerSideApply=true
- RespectIgnoreDifferences=true
ignoreDifferences:
- group: ""
kind: Secret
jsonPointers:
- /dataTwo sync options deserve attention. ServerSideApply=true avoids the annotation size limit that trips up large Helm charts — the victoria-metrics chart generates resources with enough metadata that client-side apply reliably hits the 256KB annotation limit. RespectIgnoreDifferences=true works in conjunction with the ignoreDifferences block, which tells ArgoCD to stop flagging Secret /data as drifted. The chart manages Grafana credentials in a Secret; ArgoCD should not fight it.
victoria-metrics values highlights
# apps/victoria-metrics/values.yaml
vmalert:
enabled: false
alertmanager:
enabled: false
# Disabled: service name exceeds 63 chars with this release name
kubeControllerManager:
enabled: false
vmsingle:
spec:
retentionPeriod: "1" # 1 month
storage:
storageClassName: longhorn
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
grafana:
enabled: true
service:
type: LoadBalancer
loadBalancerIP: 192.168.55.203
persistence:
enabled: true
storageClassName: longhorn
size: 1Gi
plugins:
- victoriametrics-logs-datasourceThe kubeControllerManager scrape is disabled for an unglamorous reason: the generated service name (victoria-metrics-victoria-metrics-k8s-stack-kube-controller-manager) is 65 characters long, and Kubernetes service names must be 63 characters or fewer. The chart does not expose a way to override the release name in that specific component, so it goes off for now.
victoria-logs Application
# apps/root/templates/victoria-logs.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: victoria-logs
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: infrastructure
sources:
- repoURL: https://victoriametrics.github.io/helm-charts/
chart: victoria-logs-single
targetRevision: "0.11.28"
helm:
releaseName: victoria-logs
valueFiles:
- $values/apps/victoria-logs/values.yaml
- repoURL: {{ .Values.repoURL }}
targetRevision: {{ .Values.targetRevision }}
ref: values
destination:
server: {{ .Values.destination.server }}
namespace: monitoring
syncPolicy:
automated:
prune: false
selfHeal: true
syncOptions:
- ServerSideApply=trueThe values for VictoriaLogs are deliberately minimal:
# apps/victoria-logs/values.yaml
server:
retentionPeriod: 14d
persistentVolume:
enabled: true
size: 20Gi
storageClass: longhorn
service:
type: ClusterIP
port: 9428VictoriaLogs does not need a LoadBalancer service — nothing outside the cluster talks to it directly. Fluent Bit writes to it, and Grafana queries it, both from inside the cluster.
Gotcha 1: The Stale ValidatingWebhookConfiguration
This one cost about an hour.
Symptom
After the initial sync, VMSingle came up healthy. But vmagent never appeared — no Deployment, no pods, no events. The ArgoCD UI showed the Application as Synced/Degraded with a cryptic error about the VMAgent custom resource not reconciling. Running kubectl get vmagent -n monitoring showed the object existed. Running kubectl describe vmagent -n monitoring victoria-metrics-victoria-metrics-agent showed… nothing obviously wrong.
Diagnosis
The victoria-metrics operator uses a ValidatingWebhookConfiguration to validate its custom resources on creation and update. The sequence of events was:
- First install: operator deployed, registered webhook with its own CA bundle
- Something went wrong mid-install (likely the 63-char service name issue above)
- The Application was deleted and re-installed to start clean
- On reinstall, the old
ValidatingWebhookConfigurationremained — it was not owned by the Helm release, so Helm did not delete it - The stale webhook pointed at the old operator’s TLS certificate, which no longer matched the new pod’s certificate
- Every time the operator tried to reconcile the
VMAgentresource, the API server called the webhook, the TLS handshake failed, and the reconciliation was silently dropped
The giveaway was in kubectl get events -n monitoring --sort-by='.lastTimestamp':
Warning FailedCreate validatingwebhookconfiguration/victoria-metrics-victoria-metrics-operator-admission
x509: certificate signed by unknown authorityFix
Delete the stale webhook configuration. The operator re-registers it within seconds:
kubectl delete validatingwebhookconfiguration \
victoria-metrics-victoria-metrics-operator-admissionWithin about 30 seconds, the operator re-created the webhook with a fresh CA bundle matching the current pod’s certificate. The VMAgent Deployment appeared immediately, pods started, and scraping began.
The lesson: When a Kubernetes operator uses admission webhooks and the install/reinstall cycle is not clean, always check for stale ValidatingWebhookConfiguration or MutatingWebhookConfiguration objects. They survive Helm releases and cause exactly this kind of ghost-in-the-machine behavior where objects exist but nothing happens to them.
Gotcha 2: The Fluent Bit Hostname
Symptom
Fluent Bit DaemonSet was running on all nodes. No errors in kubectl logs. But querying VictoriaLogs showed zero documents. The Fluent Bit logs showed continuous retry loops with no error messages — just [engine] flush chunk ... retry=true.
Diagnosis
Fluent Bit’s [OUTPUT] block uses an HTTP plugin to forward logs. The Host field must resolve to a valid in-cluster DNS name. The initial configuration used:
Host victoria-logs-victoria-logs-single.monitoring.svc.cluster.localThat hostname does not exist. The victoria-logs-single chart names its Service with a -server suffix: the actual Service name is victoria-logs-victoria-logs-single-server. The chart does not document this clearly, and the default Helm release name (victoria-logs) combined with the chart name (victoria-logs-single) produces a long, non-obvious service name.
To find the correct service name:
kubectl get svc -n monitoring | grep victoria-logsvictoria-logs-victoria-logs-single-server ClusterIP 10.96.x.x <none> 9428/TCPFix
Update the Host in the Fluent Bit output config to use the correct service name:
# apps/fluent-bit/values.yaml (correct)
config:
outputs: |
[OUTPUT]
Name http
Match kube.*
Host victoria-logs-victoria-logs-single-server.monitoring.svc.cluster.local
Port 9428
URI /insert/jsonline?_stream_fields=stream,kubernetes_pod_name,kubernetes_namespace_name,kubernetes_container_name&_msg_field=log&_time_field=time
Format json_lines
Json_Date_Key time
Json_Date_Format iso8601
Retry_Limit FalseAfter the fix was committed and synced, log data started appearing in VictoriaLogs within the next Fluent Bit flush cycle (a few seconds).
The lesson: DNS failures in Kubernetes are silent killers. Fluent Bit does not differentiate between “server returned an error” and “DNS lookup failed” in its retry log output. When a log shipper shows retries with no error detail, the first check should always be whether the destination hostname resolves at all — a one-liner from any pod in the namespace confirms it: kubectl exec -n monitoring <any-pod> -- nslookup <hostname>.
VictoriaLogs + Fluent Bit: The Pipeline
With the hostname corrected, the pipeline is clean. Fluent Bit runs as a DaemonSet across all nodes — including the GPU node and control-plane nodes, which carry taints that would normally prevent scheduling:
# apps/fluent-bit/values.yaml (tolerations)
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu
operator: Exists
effect: NoScheduleWithout these tolerations, the control-plane nodes (mini-1, mini-2, mini-3) and the GPU node (gpu-1) would not get a Fluent Bit pod, leaving their container logs uncollected.
The full pipeline is three stages:
INPUT — tail: Reads from /var/log/containers/*.log on the host filesystem. The multiline.parser docker, cri handles both Docker-format logs (from older container runtimes) and CRI-format logs (what Talos uses with containerd). Mem_Buf_Limit 5MB prevents the buffer from consuming unbounded memory on a busy node.
FILTER — kubernetes: Enriches each log line with Kubernetes metadata — pod name, namespace, container name, labels, and annotations — by querying the Kubernetes API. Merge_Log On flattens JSON logs written by applications into the top-level log record rather than nesting them under a log key. K8S-Logging.Exclude On respects the fluentbit.io/exclude: "true" pod annotation, giving individual workloads an opt-out.
OUTPUT — http (VictoriaLogs jsonline): Sends the enriched log lines to VictoriaLogs via HTTP POST. The URI encodes four important parameters:
/insert/jsonline
?_stream_fields=stream,kubernetes_pod_name,kubernetes_namespace_name,kubernetes_container_name
&_msg_field=log
&_time_field=time_stream_fieldstells VictoriaLogs which fields define a log stream (equivalent to Loki’s labels). Choosing pod name, namespace, and container name gives per-container granularity without over-cardinality._msg_field=logmaps Fluent Bit’slogfield to VictoriaLogs’ message field._time_field=timetells VictoriaLogs to use the log’s original timestamp rather than the ingestion time.
Retry_Limit False means Fluent Bit will retry indefinitely on failure. This is appropriate for a homelab — we would rather have Fluent Bit buffering logs and retrying than dropping them silently when VictoriaLogs restarts for maintenance.
Gotcha 3: additionalDataSources Does Not Work
Symptom
After deploying both VictoriaMetrics and VictoriaLogs, Grafana had the VictoriaMetrics datasource pre-configured (handled by the chart), but VictoriaLogs was absent. Adding it via grafana.additionalDataSources in the victoria-metrics values had no effect regardless of how many times the Application was synced.
Diagnosis
The victoria-metrics-k8s-stack chart manages Grafana datasource provisioning through its own ConfigMap — victoria-metrics-victoria-metrics-k8s-stack-grafana-ds — rather than delegating to the Grafana subchart’s standard provisioning mechanism. This ConfigMap is templated and controlled by the VictoriaMetrics chart directly.
The consequence: grafana.additionalDataSources, which works by adding entries to the Grafana subchart’s own datasource provisioning ConfigMap, is never consulted. The chart simply does not pass that value through. The VictoriaMetrics chart’s own datasource ConfigMap overwrites whatever the subchart would have generated.
Fix
The solution is a standalone provisioning ConfigMap mounted into Grafana via extraConfigmapMounts — bypassing the chart’s own provisioning entirely:
# apps/victoria-metrics/manifests/grafana-victorialogs-ds.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-victorialogs-datasource
namespace: monitoring
data:
victorialogs-datasource.yaml: |
apiVersion: 1
datasources:
- name: VictoriaLogs
type: victoriametrics-logs-datasource
access: proxy
url: http://victoria-logs-victoria-logs-single-server.monitoring.svc.cluster.local:9428
isDefault: false
editable: falseThis ConfigMap is deployed as a third source in the victoria-metrics ArgoCD Application and mounted into Grafana at /etc/grafana/provisioning/datasources/victorialogs.yaml via:
grafana:
extraConfigmapMounts:
- name: victorialogs-datasource
mountPath: /etc/grafana/provisioning/datasources/victorialogs.yaml
subPath: victorialogs-datasource.yaml
configMap: grafana-victorialogs-datasource
readOnly: trueOn pod restart, Grafana reads the provisioning file and adopts the datasource — marking it readOnly (non-editable in the UI). The datasource is now fully declarative: it will be recreated correctly on any Grafana redeploy, regardless of PVC state.
The lesson: Helm chart composition is leaky. When chart A embeds chart B as a subchart, chart A can intercept and override anything chart B would have done. Relying on subchart values working end-to-end is not safe without reading the parent chart’s templates. The escape hatch is extraConfigmapMounts — it operates at the Pod level and is independent of the chart’s own provisioning logic.
What Is Visible Now
With all three Applications healthy, the cluster has full observability:
Grafana at http://192.168.55.203 ships pre-built dashboards:
- Node Exporter Full — per-node CPU, memory, disk I/O, network throughput, filesystem usage. The mini nodes’ iGPU and the RTX 5070 show up in system metrics.
- Kubernetes / Compute Resources / Cluster — cluster-wide CPU and memory requests vs limits vs usage.
- Kubernetes / Compute Resources / Namespace — same broken down per namespace.
- Kubernetes / Networking — pod-to-pod traffic, DNS query rates, connection counts.
- VMAgent — internal metrics for the scraping engine: targets scraped, samples/sec, queue depth.
VictoriaLogs is queryable via Grafana’s Explore tab using LogQL-like syntax. Useful starting queries:
# All logs from the argocd namespace
{kubernetes_namespace_name="argocd"}
# Logs from a specific pod
{kubernetes_pod_name=~"victoria-metrics.*"}
# Error lines across the cluster
{kubernetes_namespace_name=~".+"} |= "error"node-exporter is running on all six nodes (the Raspberry Pis count — raspi-1 and raspi-2 each contribute their ARM metrics to the same dashboards).
What is not yet visible: alerting. VMAlert and Alertmanager are disabled pending alert rule tuning. That is Layer 9.
References
- VictoriaMetrics Helm Charts — Source for both
victoria-metrics-k8s-stackandvictoria-logs-single - VictoriaLogs Documentation — Ingestion API, query language, retention configuration
- Fluent Bit Documentation — Input/filter/output plugin reference, multiline parsing, Kubernetes filter
- victoria-metrics-k8s-stack Chart — Chart source, values reference, CRD documentation