Skip to content

Running occ Commands (NextcloudCommand)

Sometimes you need to run a one-off occ command against a running Nextcloud instance — install an app, toggle a setting, kick off a repair. Historically that meant kubectl exec into the pod or asking an operator maintainer to bake a new canned flow into the codebase.

The NextcloudCommand CRD exposes the operator's internal occ runner as a kubectl-native API:

  • Declarative — apply a YAML, the operator runs the commands.
  • Auditable — per-command exit codes, stdout/stderr snippets, and timestamps land in .status.
  • Self-cleaning — objects are garbage-collected after spec.ttlSecondsAfterFinished (default 7 days; set 0 to keep forever).
  • Safe argv — commands are argv arrays, not shell strings; there is no shell interpolation in the target pod.

Quick start

apiVersion: k8s.bnerd.com/v1alpha1
kind: NextcloudCommand
metadata:
  name: occ-status
  namespace: default
spec:
  targetRef:
    kind: NextcloudInstance
    name: my-nextcloud
  commands:
    - ["status", "--output=json"]
kubectl apply -f command-occ-status.yaml
kubectl get nccmd -w            # watch Pending → Running → Succeeded
kubectl get nccmd occ-status -o jsonpath='{.status.results[0].stdoutSnippet}'

Multi-step example

apiVersion: k8s.bnerd.com/v1alpha1
kind: NextcloudCommand
metadata:
  name: enable-richdocuments
  namespace: default
spec:
  targetRef:
    kind: NextcloudInstance
    name: my-nextcloud
  commands:
    - ["app:install", "richdocuments"]
    - ["app:enable", "richdocuments"]
    - ["config:app:set", "richdocuments", "wopi_url", "--value=https://office.example.com"]
  timeoutSeconds: 600
  perCommandTimeoutSeconds: 180
  haltOnError: true
  ttlSecondsAfterFinished: 604800   # keep for 7 days (this is the default)

Commands run serially. With haltOnError: true (the default), the first non-zero exit stops the run and marks the CR Failed.

Targeting

spec.targetRef.kind may be:

  • NextcloudInstance — direct reference to the physical instance in the same namespace.
  • Nextcloud — the operator reads the Nextcloud's .status.instanceRef to find the assigned NextcloudInstance (which may live in a different namespace, e.g. for pool-assigned instances). If the Nextcloud isn't assigned yet, the run is deferred with a TemporaryError and retried.

Result reporting

.status.results[] is appended in order of spec.commands. Each entry contains:

Field Meaning
command The argv that was run.
exitCode occ exit code. -1 = command did not run because timeoutSeconds was exceeded.
stdoutSnippet / stderrSnippet First 4 KiB + last 4 KiB, joined with \n...<truncated>...\n when longer. Enough to diagnose failures without blowing etcd object-size limits.
startedAt / finishedAt Per-command timing.
timedOut true when the overall cap fired before this command could run.

Use kubectl get nccmd <name> -o yaml for full details or the printer columns (Phase, Target, Kind, Message) for a summary.

RBAC

NextcloudCommand is powerful — it lets the holder run arbitrary occ subcommands, which can modify config, users, and app state. Treat creation permissions as equivalent to admin access on the target instance. Grant create/get/list/watch on nextcloudcommands only to subjects that should have operator-level control, typically cluster-admins or a dedicated automation ServiceAccount.

The operator itself already has pods/exec and NextcloudCommand CRUD permissions via the cluster role shipped with the chart.

Secret handling

The operator sanitizes argument values that follow flags matching password|secret|key|token|credential (case-insensitive) in its debug logs — both --password=value and --password value shapes are redacted. That covers the common occ flag conventions. Note: the spec.commands themselves are readable by anyone with get on the CR, so don't paste production secrets into the spec if you can avoid it — use config:system:set --value=@/path/to/file patterns or pre-populated Nextcloud secrets where possible.

Retry policy

Transient failures from the underlying pod-exec primitive (TemporaryError — pod not found, transport error, handshake failure) are retried automatically with exponential backoff:

  • Max attempts: 10
  • Backoff: 15 s → 30 s → 60 s → 120 s → 240 s, capped at 300 s thereafter
  • Worst-case total wait: about 25 minutes of backoff before the CR is marked Failed

While retrying, .status.conditions[] carries a Progressing/Retrying condition whose message includes the attempt number and the underlying error, and an event is posted on the CR for every retry. When retries exhaust, the CR transitions to phase: Failed with reason RetriesExhausted — it never hangs at Running.

Permanent failures (validation errors, invalid targetRef) fail immediately with no retry.

Lifecycle hooks

Beyond the on-demand pattern shown above, NextcloudCommand can be gated on the target instance being Ready and combined with a small bash beforeScript for setup. Operator-managed lifecycle hooks declared on NextcloudInstance.spec.hooks (or upstream Nextcloud / Profile / Pool) materialize these NCCs automatically at well-defined moments.

Triggers

Trigger Fires when Cardinality Settable on
onFirstReady Instance reaches Ready for the first time in its lifetime Once per instance Profile, Pool template, Nextcloud, NCI
onAssignmentReady Pool instance reaches Ready after being assigned to a Nextcloud Once per assignment event Profile, Nextcloud, NCI
onEveryReady Instance reaches Ready, every time (incl. after upgrades, recovery) Re-fires (fresh NCC per edge) Profile, Pool template, Nextcloud, NCI

A single wall-clock Ready edge can fire multiple triggers. For example, a fresh NCI's first Ready fires onFirstReady and onEveryReady. A pool instance's first Ready after assignment fires onAssignmentReady and onEveryReady.

Declaring hooks

apiVersion: k8s.bnerd.com/v1alpha1
kind: NextcloudInstance
metadata:
  name: my-tenant
spec:
  hooks:
    onFirstReady:
    - name: baseline-apps
      commands:
      - ["app:install", "richdocuments"]
      - ["app:enable", "richdocuments"]
    onAssignmentReady:
    - name: brand
      beforeScript: |
        set -e
        curl -fsSL https://assets.example.com/logo.png \
          -o /var/www/html/themes/custom/logo.png
      commands:
      - ["theming:config", "logo", "/themes/custom/logo.png"]
    onEveryReady:
    - name: warm-cache
      commands:
      - ["files:scan", "--all"]

Cascade

Hooks set on NextcloudProfile.spec.defaults, NextcloudPool.spec.template.spec, Nextcloud.spec, or NextcloudInstance.spec cascade through the standard order: Profile → Pool → Nextcloud → NCI. The last source wins (replace, not concatenate).

The pool template intentionally does not accept onAssignmentReady — pool instances don't yet know their tenant. Set it on the Nextcloud or NCI instead.

Generated NCC names

Operator-materialized NCCs are deterministically named:

  • {instance}-hk-fr-{name}-{8char-hash}onFirstReady
  • {instance}-hk-ar-{name}-{8char-hash}onAssignmentReady
  • {instance}-hk-er-{name}-e{edge}-{8char-hash}onEveryReady (edge counter ensures a fresh NCC per Ready edge)

Editing an entry → new hash → a new NCC fires. Old NCCs remain as audit records and are TTL-cleaned after 7 days (override per-NCC with ttlSecondsAfterFinished).

Operator-wide TTL cap

Set the operator env var NCC_GC_MAX_TTL_SECONDS to cap the effective GC TTL of every finished NextcloudCommand, overriding a larger per-NCC ttlSecondsAfterFinished downward. Unset or 0 means no cap. This is useful to drain a large backlog of finished commands (e.g. from a high-frequency onEveryReady hook) without patching or mass-deleting them — set it low, let the operator's own garbage collector clean up, then raise or remove it. A per-NCC TTL of 0 (keep forever) is never overridden.

Self-heal for stranded Pending OnReady commands

A Pending OnReady NCC whose on_create retry chain was abandoned (e.g. after a long string of operator restarts) would otherwise stay stuck forever — kopf treats its create handler as already-handled and won't re-gate. The operator runs a periodic timer that re-invokes the gate for such stranded commands; once the target is Ready, the hook runs and transitions to Succeeded. Tunable via env vars:

  • TIMER_COMMAND_REGATE_INTERVAL — timer interval in seconds (default 300).
  • NCC_REGATE_STALE_SECONDS — a Pending OnReady command must have had no status transition within this window before the timer treats it as stranded. Default 600. Lower for faster recovery, higher to avoid racing on_create's own active retries.

Standalone OnReady NCCs

You can also create NCCs directly with spec.lifecycle: OnReady to gate a one-off command on Ready, without using spec.hooks. The same gate logic applies (wait for phase=Ready AND a Running+Ready pod). The retry budget for OnReady is generous (~200 minutes) so first reconciles that include slow database provisioning can still converge before the CR self-fails.

Failure surface

If a hook NCC ends in phase: Failed, the NCI gains a LifecycleHookFailed=True condition with the failing NCC name in the message. The NCI's own phase stays Ready — the instance is serving; only the hook failed. Inspect the failing NCC's .status for the exit codes and snippets.

beforeScript notes

  • Runs in the nextcloud container via /bin/bash -c <script> (same container as occ).
  • No environment-variable injection in v1; reference values via filesystem mounts (e.g. existing chart-mounted secrets) or hard-code in the script.
  • A non-zero exit code marks the NCC Failed with condition BeforeScriptFailed and skips the commands array. The exit code, stdout, and stderr land in .status.beforeScriptResult.

Limitations (v1)

  • Spec is immutable. To re-run, apply a new CR. This keeps .status a faithful audit trail of a single run.
  • No live streaming. Output lands in .status after each command finishes. For very long commands, consider splitting them or polling .status with kubectl get -w.
  • No per-instance serialization. Running two NextcloudCommand CRs concurrently against the same instance is the caller's responsibility. Nextcloud is not always safe for parallel occ execution (e.g. during upgrade) — serialize in the caller.
  • No --no-interaction injected. You supply the literal argv. Add -n / --no-interaction yourself when running commands that might prompt; otherwise they will block and hit the per-command timeout.
  • No webhook callbacks / synchronous HTTP API — deferred to a later iteration. Poll .status.phase to detect completion.
  • Non-idempotent commands. If the operator retries the handler (e.g. the pod was transiently unavailable), commands run again from the start. Commands like app:install richdocuments are idempotent; commands like user:add are not. Prefer idempotent occ subcommands when possible.

See also