
Operating on ArgoCD Drift
This is a debugging-focused companion to Operating on GitOps. That post covers the day-to-day ArgoCD commands. This one is about what happens when the OutOfSync column stops being useful.
The Problem
When I last ran argocd app list, I got this:
NAME SYNC STATUS HEALTH STATUS
argo-rollouts OutOfSync Progressing
gitea OutOfSync Progressing
gitea-extras OutOfSync Healthy
gpu-operator OutOfSync Healthy
grafana-alerting OutOfSync Healthy
...Twenty apps. One third of the cluster. All permanently out of sync.
I’d been ignoring it. Everything worked. The dashboards were green. The workloads ran. “Out of sync” had become part of the scenery — the ArgoCD column equivalent of an unread count you stop looking at.
Then I decided to investigate. Not one bug. Seven. And one of them was hiding a twenty-one-day crashloop.
How to Actually Diagnose Drift
The official diagnosis command is argocd app diff <app>. In principle it shows you exactly what ArgoCD thinks is different between git and the cluster. In practice, when you have twenty drifting apps, you want a bird’s-eye view first.
Start with per-app resource counts:
kubectl -n argocd get applications -o json \
| jq -r '.items[] | .metadata.name as $app
| .status.resources[]?
| select(.status != "Synced")
| "\($app)\t\(.kind)/\(.name)\t\(.namespace // "cluster")"' \
| sortThat one pipe gives you the shape of the drift: which app has which kind drifting, at which scope. Patterns jump out immediately.
$ kubectl -n argocd get applications -o json | jq -r -f /tmp/drift.jq | sort | head -30
argocd Role/argocd-redis-secret-init argocd
argocd RoleBinding/argocd-redis-secret-init argocd
argocd ServiceAccount/argocd-redis-secret-init argocd
root Application/gpu-operator argocd
root Application/sympozium argocd
root Application/victoria-metrics argocd
sympozium-extras PersonaPack/developer-team default
sympozium-extras PersonaPack/devops-essentials default
sympozium-extras PersonaPack/platform-team default
tekton-extras EventListener/gitea-listener tekton-pipelines
tekton-extras Pipeline/gitea-ci tekton-pipelines
tekton-extras Task/build-push tekton-pipelines
tekton-extras Task/cosign-sign tekton-pipelines
tekton-extras Task/git-clone tekton-pipelines
vcluster-experiments StatefulSet/experiments vcluster-experiments
vk-remote Job/postgres-vk-init-electric agents
On my cluster the output was dominated by three kinds:
ExternalSecret/*— ten different appsApplication/*— twelve entries, all listed under therootappCustomResourceDefinition/*— twelve entries across argo-rollouts, tekton-pipelines, tekton-dashboard
Those aren’t random. Each cluster is its own drift class with its own fix.
Class A: CRD Schema Defaults
Every ExternalSecret in git looked like this:
spec:
target:
name: paperclip-anthropic
creationPolicy: Owner
data:
- secretKey: ANTHROPIC_API_KEY
remoteRef:
key: ANTHROPIC_API_KEYThe live object looked like this:
spec:
target:
name: paperclip-anthropic
creationPolicy: Owner
deletionPolicy: Retain # <-- defaulted by CRD schema
data:
- secretKey: ANTHROPIC_API_KEY
remoteRef:
key: ANTHROPIC_API_KEY
conversionStrategy: Default # <-- defaulted
decodingStrategy: None # <-- defaulted
metadataPolicy: None # <-- defaultedThe External Secrets CRD has default values baked into its OpenAPI schema. The API server injects them on kubectl apply. Git doesn’t have them. Three-way diff flags the gap. Forever.
Fix: pin the defaults in git so it matches what the CRD writes.
spec:
target:
name: paperclip-anthropic
creationPolicy: Owner
deletionPolicy: Retain
data:
- secretKey: ANTHROPIC_API_KEY
remoteRef:
key: ANTHROPIC_API_KEY
conversionStrategy: Default
decodingStrategy: None
metadataPolicy: NoneSixteen manifests, four lines each. A mechanical edit — but it’s the cheapest way to close the diff. The alternative (ignoreDifferences) hides real changes to those same fields if we ever start setting them intentionally, so I prefer pinning.
After pinning: ten apps moved from OutOfSync to Synced in two minutes.
Class B: The Default-Value Phantom Diff
The root Application listed twelve child Applications as drifting. All of them. And every one of those templates had this block:
syncPolicy:
automated:
prune: false
selfHeal: trueThe live Application CR had this:
syncPolicy:
automated:
selfHeal: trueThat’s it. prune: false isn’t present.
This is the mirror of Class A. ArgoCD’s Application CRD has prune: false as its schema default. When ArgoCD’s own controller writes the CR, it normalises the default away — absent and false are semantically identical. Git still has the explicit line. Three-way diff flags the gap on every one of the twelve child templates.
Fix: drop the explicit line.
for f in apps/root/templates/*.yaml; do
sed -i '/^ prune: false$/d' "$f"
doneFifty-one templates. One commit. The root app went from permanently OutOfSync to Synced.
I also left a comment in apps/root/values.yaml explaining why:
# Note: Application templates do NOT set automated.prune explicitly — the
# schema default (false) is our project-wide convention (manual pruning only).
# ArgoCD normalizes explicit `prune: false` to absent, which caused permanent
# drift on the root Application until we dropped the line from the templates.There’s a sibling of this bug: group: "" in ignoreDifferences blocks. Same shape — ArgoCD treats empty-string groups as unset, strips them, and the three-way diff fires. Twenty-one more templates edited.
Class C: Orphan CRDs
All twelve drifting CRDs were from Tekton and Argo Rollouts. Live CRDs had zero ArgoCD tracking-id annotation; git had the manifest checked in but ArgoCD didn’t believe it owned the object:
kubectl get crd rollouts.argoproj.io -o jsonpath='{.metadata.annotations}'Output:
{"controller-gen.kubebuilder.io/version":"v0.14.0","helm.sh/resource-policy":"keep"}No argocd.argoproj.io/tracking-id. Those CRDs were installed by a pre-ArgoCD bootstrap — the cluster created them before the Application existed. ArgoCD won’t silently adopt strangers. So every reconcile it said “these aren’t mine, OutOfSync.”
Fix: explicitly annotate them:
for crd in analysisruns.argoproj.io analysistemplates.argoproj.io \
clusteranalysistemplates.argoproj.io experiments.argoproj.io \
rollouts.argoproj.io; do
kubectl annotate crd $crd \
"argocd.argoproj.io/tracking-id=argo-rollouts:apiextensions.k8s.io/CustomResourceDefinition:/$crd" \
--overwrite
doneBut adoption alone wasn’t enough. Even after annotation, ArgoCD still reported OutOfSync. The chart renders the CRDs without the kubectl.kubernetes.io/last-applied-configuration annotation that kubectl apply writes, and the three-way diff keeps flagging that mismatch. Apply succeeds every sync (serverside-applied), then the next comparison re-flags.
For CRDs specifically, I gave up fighting and used ignoreDifferences:
ignoreDifferences:
- group: apiextensions.k8s.io
kind: CustomResourceDefinition
jsonPointers:
- /metadata/labels
- /metadata/annotations
- /spec/preserveUnknownFieldsThe schema still gets synced. Only the metadata noise is silenced.
Class D: Zombie Sub-charts
Two Helm charts had the same smell: values set enabled: false, but cluster had resources.
grep -E "redis-cluster|mongodb|ingress" apps/gitea/values.yaml apps/infisical/values.yamlapps/gitea/values.yaml:redis-cluster:
apps/gitea/values.yaml: enabled: false
apps/infisical/values.yaml:ingress:
apps/infisical/values.yaml: nginx:But the cluster had Service/gitea-redis-cluster, ConfigMap/infisical-ingress-nginx-controller, a dozen other orphans — ServiceAccounts, Roles, ValidatingWebhookConfigurations, an IngressClass. Subchart resources from when those features were enabled, kept alive by prune: false.
These aren’t reclaimable without confirming nothing uses them. The bitnami nginx orphans even included cluster-scoped resources that could break other apps if I got it wrong.
Pre-delete verification:
# Are any Ingress resources still using the nginx IngressClass?
kubectl get ingress -A -o json \
| jq -r '.items[] | select(.spec.ingressClassName=="nginx")
| "\(.metadata.namespace)/\(.metadata.name)"'Empty output = safe. Every cluster uses Traefik now, not nginx.
Rollback dump (for every resource, before deleting):
mkdir -p /tmp/argocd-drift
kubectl get clusterrole infisical-ingress-nginx -o yaml \
> /tmp/argocd-drift/rollback-infisical-clusterrole.yaml
# ...repeated for each objectThen delete, dependency-safe, one at a time:
kubectl -n infisical delete rolebinding infisical-ingress-nginx
kubectl -n infisical delete role infisical-ingress-nginx
kubectl delete clusterrolebinding infisical-ingress-nginx
kubectl delete clusterrole infisical-ingress-nginx
kubectl delete validatingwebhookconfiguration infisical-ingress-nginx-admission
kubectl delete ingressclass nginx
kubectl -n infisical delete cm infisical-ingress-nginx-controller mongodb-common-scripts
kubectl -n infisical delete sa infisical-ingress-nginx mongodb redisIf anything breaks: kubectl apply -f /tmp/argocd-drift/rollback-*.yaml.
Nothing broke. Infisical stayed Synced/Healthy through the whole deletion.
Class E: Two Apps Fighting for One Namespace
sympozium-extras had a namespace.yaml adding pod-security labels. sympozium (the chart) also rendered a namespace. Both apps tried to own the argocd.argoproj.io/tracking-id annotation. Every sync, whoever got there second flagged OutOfSync.
The fix I tried first — managedNamespaceMetadata on the sympozium Application — didn’t work. That feature only applies to namespaces ArgoCD auto-creates; it can’t override a chart-rendered Namespace object.
The chart doesn’t expose namespace.labels in values. Forking the chart wasn’t worth it for three sticky labels.
Fix: apply the labels out-of-band, document as a manual op, delete the duplicate manifest.
kubectl label ns sympozium-system \
pod-security.kubernetes.io/enforce=privileged \
pod-security.kubernetes.io/audit=privileged \
pod-security.kubernetes.io/warn=privileged \
--overwriteLabels are sticky — once applied, they survive chart re-renders. The plan captures this as a # manual-operation YAML block so the runbook registry knows about it.
Class F: Terminal Hook Noise
Job/postgres-vk-init-electric and PipelineRun/test-build-sign-5qtn4. Both Complete. Both showing up in app status as OutOfSync because requiresPruning: true and the project-wide prune: false refused to clean them up.
The Job is an ArgoCD PostSync hook (argocd.argoproj.io/hook: PostSync). It creates a Postgres role idempotently. The chart recreates it on every sync; hook-delete-policy: BeforeHookCreation deletes the previous one before the next run, leaving a Completed job in the window between.
If I’d stopped at one delete, this would have been a clean class. But I did — and the Job came back three minutes later, because it’s supposed to. That’s hook behaviour, not drift.
Disposition: one-shot delete is fine; permanent fix would be hook-delete-policy: BeforeHookCreation,HookSucceeded on the hook definition. I left that for a follow-up — it’s a genuine config improvement, not a drift fix.
Class G: When the Chart and the Cluster Disagree
Four apps had drift that didn’t fit any of the above: victoria-metrics, gpu-operator, vcluster-experiments, infisical (its own Deployment), and the infisical-postgresql PDB.
Each one needed per-resource investigation. The pattern turned out to be the same: charts and operators inject fields the git source doesn’t specify, often timestamps or hashes that change between renders.
victoria-metrics-grafanaDeployment:checksum/config,checksum/sc-dashboard-provider-config,checksum/secretannotations on the pod template. Chart rotates them when ConfigMaps change. NarrowignoreDifferenceson those pointers.gpu-operatorClusterPolicy: the NVIDIA operator webhook defaults dozens of sub-fields we intentionally leave unset (driver off, toolkit off, CDI off — Talos handles the driver stack). Fighting field-by-field isn’t worth it.ignoreDifferenceson/specwholesale.vcluster-experimentsStatefulSet: avClusterConfigHashannotation plus Kubernetes-defaulted fields (whenScaled,revisionHistoryLimit,updateStrategy) that the chart doesn’t render.infisicalDeployment: the chart stampsupdatedAt: "2026-04-04 UTC 21:31:24"on every render. Nothing to do except ignore.infisical-postgresqlPDB: standalone single-replica Postgres — PDB provides zero protection, and the chart-rendered PDB (maxUnavailable: "") diverges from what Kubernetes defaults the empty string to. Easiest fix:pdb.create: falsein values. Delete the PDB entirely.
Each of these got a narrow ignoreDifferences in the Application CR, scoped to a specific JSON pointer. The full working object spec stays under GitOps control.
The Unmasked Bug
The most important thing I learned wasn’t about ArgoCD normalisation or SSA three-way diffs.
Four of the twenty drifting apps had health Progressing. They’d been Progressing for weeks. I’d stopped looking at the column. In my head, Progressing meant “mid-reconcile, probably fine.”
When I resolved argo-rollouts’s drift, the controller logs became readable for the first time. The pods looked like this:
NAME READY STATUS RESTARTS AGE
argo-rollouts-6b4c4dfbd9-ghl9c 0/1 CrashLoopBackOff 1154 (2m18s ago) 21dOne thousand one hundred and fifty-four restarts. Twenty-one days.
The pod log:
time="2026-04-16T21:10:14Z" level=info msg="Argo Rollouts starting" version=v1.8.4
time="2026-04-16T21:10:14Z" level=info msg="Downloading plugin argoproj-labs/cilium
from: https://github.com/argoproj-labs/rollouts-plugin-trafficrouter-cilium/releases
/download/v0.4.1/rollouts-plugin-trafficrouter-cilium-linux-amd64"
time="2026-04-16T21:10:14Z" level=fatal msg="Failed to download plugins: ...
response code Not Found"I had a trafficRouterPlugins entry in apps/argo-rollouts/values.yaml pointing at a Cilium traffic router plugin URL. The URL 404s. The plugin was never published — the argoproj-labs organisation has no such repo:
curl -s "https://api.github.com/orgs/argoproj-labs/repos?per_page=100" \
| jq -r '.[] | select(.name | test("cilium|trafficrouter")) | .name'Returns seven other traffic-router plugins (nginx, gatewayapi, contour, glooplatform, glooedge, openshift, consul). No cilium.
I must have added the config hoping the plugin existed, or copied it from stale documentation. Either way: twenty-one days of a crashlooping controller that every other monitoring signal masked.
Fix: delete the plugin config, add a comment, let the controller start up clean:
controller:
replicas: 1
# Note: the Cilium traffic router plugin referenced in the original extras
# ConfigMap points to a release URL that 404s — the plugin was never
# published on GitHub. The controller crash-looped for 21 days because of
# this. Leaving plugin config unset until a real Cilium traffic router
# plugin exists. Canary/blueGreen on Deployments still work without it.No live Rollout was using the plugin. Nothing actually depended on it. The whole thing was aspirational config that broke the controller on startup and was invisible because the noise drowned it out.
This is the argument for taking drift seriously. A healthy ArgoCD install isn’t one where every app is Synced — it’s one where OutOfSync actually means something is wrong. When the column is always red, you stop reading it.
The Final Tally
Starting point: 20 of 52 apps OutOfSync. End state: 2 of 52 OutOfSync. Both Healthy. Both functionally fine.
The two residuals:
tekton-extras— Task/Pipeline/EventListener report OutOfSync.kubectl apply --dry-run=server -fshows no delta. Every sync logsserverside-applied. Then the next comparison re-flags. An ArgoCD-Tekton SSA quirk I couldn’t fully run to ground in this pass.vcluster-experimentsStatefulSet — every field I could identify as drifted got anignoreDifferencespointer. It still flags. Something in how the chart normalises the StatefulSet spec that I haven’t pinpointed.
I’m keeping both as residuals instead of blanket-ignoring the whole resources, because a narrow-then-wider escalation is the right path if either turns into a real problem. If you can’t explain the drift, at least name it so future-you can tell the noise from the signal.
Takeaways
OutOfSyncis a signal, not decoration. If most of your apps are always OutOfSync, the column is broken. Fix it, or mute it deliberately — don’t normalise “partially red dashboard.”- Classify before you fix. Seven drift classes. Each needed a different fix. Blanket-ignoring everything would have worked on the dashboard and hidden the 21-day crashloop.
- Pin schema defaults in git where possible. Preferred over
ignoreDifferencesbecause real changes still flag. Only reach forignoreDifferenceswhen the mutator is a controller/webhook you don’t own. - Dump before you delete.
kubectl get -o yaml > rollback.yamltakes two seconds and saves you from a bad Monday. - Read the logs of every app that moved from
Progressingto a new state. That’s where the hiding things are.
References
- Plan:
docs/superpowers/plans/2026-04-15--gitops--argocd-drift-cleanup.md - Spec:
docs/superpowers/specs/2026-04-15--gitops--argocd-drift-cleanup-design.md - Related: Operating on GitOps — day-to-day ArgoCD CLI commands
- ArgoCD
ignoreDifferences - ArgoCD
managedNamespaceMetadata - Kubernetes ServerSideApply field management