Troubleshooting¶

Day-2 reference for diagnosing a NextcloudInstance, Nextcloud, or NextcloudPool that isn't behaving as expected. Start with State inspection, then Logs, then match your symptom in the Common errors table.

State inspection — always start here¶

NS=nextcloud-demo
NAME=demo

# 1. CRD status + events (events often contain the real reason)
kubectl describe nci $NAME -n $NS

# 2. Conditions (ready / failed / why)
kubectl get nci $NAME -n $NS -o jsonpath='{.status.conditions}' | jq

# 3. Resolved Helm chart version
kubectl get nci $NAME -n $NS -o jsonpath='{.status.versionResolution}' | jq

# 4. Downstream HelmRelease
kubectl describe helmrelease -n $NS

# 5. Pods (anything CrashLoopBackOff / ImagePullBackOff / Pending?)
kubectl get pods -n $NS
kubectl describe pod -n $NS <pod-name>

# 6. Managed database (only if spec.database.managed=true)
kubectl get perconapgcluster -n $NS
kubectl describe perconapgcluster -n $NS

# 7. Secrets created by the operator
kubectl get secret -n $NS -l app.kubernetes.io/managed-by=nextcloud-operator

For the logical Nextcloud tenant resource:

kubectl describe nc $NAME -n $NS
kubectl get nc $NAME -n $NS -o jsonpath='{.status.instanceRef}' | jq

Where are the logs?¶

Component	Command
Operator (kopf handlers)	`kubectl logs -n nextcloud-operator-system -l app.kubernetes.io/name=nextcloud-operator --tail=500`
Operator — specific handler errors	Same, then `grep -i "error\\|PermanentError\\|TemporaryError"`
Flux HelmController (who actually installs the chart)	`kubectl logs -n flux-system -l app=helm-controller --tail=200`
Flux SourceController (resolves `HelmRepository`)	`kubectl logs -n flux-system -l app=source-controller --tail=200`
Nextcloud PHP	`kubectl logs -n $NS deploy/<release>-nextcloud -c nextcloud --tail=200`
Nextcloud `occ` output (on-demand)	`NextcloudCommand` CRD; result in `.status.results`
Percona PG operator	`kubectl logs -n pgo -l app.kubernetes.io/name=percona-postgresql-operator --tail=200`
PostgreSQL itself	`kubectl logs -n $NS <pgcluster>-instance1-xxxx-0 -c database --tail=200`

Make the operator verbose

The operator is started with kopf run --verbose by default in our chart. If you run it differently, increase verbosity with --verbose or --debug to see handler-level reconciliation traces.

Force reconciliation¶

When state looks wrong but no error is visible, nudge the operator to re-run its handlers:

# Force reconcile (operator re-runs all field handlers)
kubectl annotate nci $NAME -n $NS k8s.bnerd.com/reconcile=$(date +%s) --overwrite

# Force maintenance tasks on demand
kubectl annotate nci $NAME -n $NS k8s.bnerd.com/run-maintenance=$(date +%s) --overwrite

See Operations & Annotations for all supported annotations, including k8s.bnerd.com/force-delete for bypassing deletion protection.

Common errors¶

Instance stuck in `Pending` or `Creating`¶

Symptom: kubectl get nci shows Pending or Creating for more than 5 minutes.

Check in order:

Operator logs for validation errors — a PermanentError means the spec is wrong and no amount of retrying will help. Example: missing spec.admin, invalid spec.database.type, unknown spec.version.
HelmRepository + HelmRelease exist — kubectl get helmrelease,helmrepository -n $NS. If missing, Flux isn't installed or the operator couldn't reach the K8s API.
Pods Pending — usually means no default StorageClass or the cluster is out of resources. kubectl describe pod surfaces the scheduler's reason.
Pods ImagePullBackOff — the registry isn't reachable or the image/tag doesn't exist. Check spec.image / resolved chart version.

`Database not ready yet` — managed PG never becomes ready¶

Symptom: Operator logs repeat TemporaryError: Database not ready yet until the 20-minute timeout, then the instance goes Failed.

Causes:

Percona PG Operator not installed — kubectl get crd perconapgclusters.pgv2.percona.com. Install via helm install pgo percona/pg-operator -n pgo --create-namespace.
Percona operator is running but has no RBAC in the target namespace — see its logs: kubectl logs -n pgo -l app.kubernetes.io/name=percona-postgresql-operator.
No StorageClass — the PG cluster can't provision PVCs.
Resources too tight — PG instance pods pending because no node has enough CPU/memory.

To recover after fixing the underlying issue, annotate the instance to force reconcile:

kubectl annotate nci $NAME -n $NS k8s.bnerd.com/reconcile=$(date +%s) --overwrite

`HelmRelease` stuck or in `Failed` state¶

Symptom: kubectl get helmrelease -n $NS shows Ready=False.

Diagnose:

kubectl describe helmrelease -n $NS
kubectl logs -n flux-system -l app=helm-controller --tail=200 | grep $NAME

Common HelmRelease failures:

chart "nextcloud" version "x.y.z" not found — the resolved chart version doesn't exist in the repository. Check status.versionResolution on the instance; if you pinned spec.helm.version, verify that tag exists in the upstream Helm repo.
values don't validate against schema — usually from custom spec.helm.values. Test your values locally with helm template.
timed out waiting for the condition — chart installed but pods never went ready. Inspect the pods directly.

To force Flux to retry: flux reconcile helmrelease <release-name> -n $NS --with-source.

Pool instance never gets assigned¶

Symptom: Nextcloud (logical) stays in Assigning phase; NextcloudPool.status.unassigned is 0.

Causes:

Pool is drained — all instances already assigned. Increase spec.replicas on the pool or wait for the pool reconciler to replenish.
Labels don't match — spec.poolSelector.matchLabels on the Nextcloud doesn't match template.metadata.labels on the pool. Compare with kubectl get nci -A --show-labels | grep pool.
Pool instances stuck Pending — the pool is creating replacements but they can't become ready (see "Instance stuck in Pending" above).

`Cannot load API key for SignalingServer / RecordingServer`¶

Symptom: The operator retries every 60s with this TemporaryError.

The referenced credentialsSecret is missing or lacks the expected key. Check:

kubectl get secret <credentialsSecret-name> -n nextcloud-operator-system -o yaml

The secret must contain the API key under the key specified in the SignalingServer/RecordingServer spec.

`Authentication failed for <api-endpoint>: 401`¶

Symptom: PermanentError in operator logs when registering a backend.

The API key in the credentialsSecret is wrong or the backend API is rejecting it. Verify the key on the backend (signaling or recording server), update the secret, then delete the SignalingServer / RecordingServer CR to re-register cleanly.

`NextcloudCommand` times out¶

Symptom: A NextcloudCommand finishes with phase: Failed and status.results[].stderr shows a timeout.

A single command exceeded spec.perCommandTimeoutSeconds (default 300s). For expensive migrations like occ db:convert-filecache-bigint, raise the per-command timeout to e.g. 3600.
The overall job exceeded spec.timeoutSeconds. Raise it or split the commands across multiple NextcloudCommand resources.
No running Nextcloud pod was available — the instance is not Ready. Check spec.targetRef points to a healthy NextcloudInstance/Nextcloud.

`S3 data backup enabled but no repository configured`¶

Symptom: PermanentError at instance creation time.

spec.backups.data.enabled: true requires either an S3Backup CRD (from the bnerd backup operator) to be installed, or the backup repository to be configured. Install the backup operator, or disable the feature.

CrashLoopBackOff on Nextcloud pod after an upgrade¶

Symptom: After bumping spec.version, pods crash-loop with migration errors.

Post-upgrade migrations should run automatically via utils/maintenance.py, but you can trigger them manually:

kubectl annotate nci $NAME -n $NS k8s.bnerd.com/run-maintenance=$(date +%s) --overwrite

Watch the operator logs to confirm the maintenance tasks ran. If migrations like add-missing-indices or convert-filecache-bigint fail, inspect the output and run them as a dedicated NextcloudCommand with a longer timeout.

Finalizer blocks namespace deletion¶

Symptom: kubectl delete namespace hangs; the namespace stays in Terminating.

A NextcloudInstance still has a finalizer because cleanup is incomplete (typically a managed DB that won't delete). To unblock:

# Check what's blocking
kubectl get nci -n $NS -o jsonpath='{.items[*].metadata.finalizers}'

# Force-delete (skips operator cleanup — only use when you accept losing state)
kubectl annotate nci $NAME -n $NS k8s.bnerd.com/force-delete=true --overwrite
kubectl delete nci $NAME -n $NS

For the full teardown order, audit log, and recreate-safety behaviour, see Deletion & Cleanup.

When to file a bug vs. keep debugging¶

File a bug if:

The operator panics or the pod crash-loops (kubectl logs shows a Python traceback without a clear PermanentError).
A TemporaryError repeats indefinitely even after the underlying cause is fixed.
Status fields contradict reality (e.g. phase: Ready but pods are CrashLoopBackOff). After the layered-readiness fix, Ready only gets set when the HelmRelease, Deployment, Endpoints, and Ingress all report ready — so this combination should no longer be reachable. If you see it, file a bug with kubectl get nci $NAME -n $NS -o yaml attached.
Instance is stuck in Deploying: inspect status.workload and the Ready condition's reason. WaitingForHelmRelease → check the Flux HelmRelease (kubectl describe helmrelease ...). WaitingForPods → describe the Nextcloud Deployment; usually a values misconfiguration or PVC problem. WaitingForEndpoints → the Service has no ready backends (pod ready probe failing). WaitingForIngress → the cluster's ingress controller hasn't assigned a load-balancer address yet.
Instance is stuck in Creating with database.managed: true: inspect status.database and the DatabaseReady condition. reason=Initializing is normal — Percona PG cluster startup typically takes 1–5 min. reason=ProvisioningFailed → check kubectl describe perconapgcluster $NAME-pg -n $NS and look at events. reason=Timeout → 20 min have passed and the cluster still isn't ready; phase will transition to Failed but the operator keeps retrying. Common causes: missing StorageClass, pg-operator pod not running, image pull failure. Fix the underlying infra issue and the next 60 s retry should self-heal back to Creating → Deploying → Ready without operator intervention.
Instance is at phase: Failed with status.database.reason=PgOperatorNotFound: the Percona PG operator CRD is not installed in the cluster. This is a PermanentError — install the pg-operator (make install-pg-operator or your usual Flux/Helm flow) and then re-trigger via kubectl annotate nci $NAME -n $NS k8s.bnerd.com/reconcile=$(date +%s) --overwrite.

Keep debugging yourself if:

The error message explicitly says what's wrong (missing secret, invalid field, wrong version). The operator is telling you — believe it.
The HelmRelease is failing — that's a chart/values issue, not an operator bug.
Kubernetes primitives are broken (no StorageClass, no ingress, no DNS). Those aren't the operator's job to fix.