Troubleshooting¶
Day-2 reference for diagnosing a NextcloudInstance, Nextcloud, or NextcloudPool that isn't behaving as expected. Start with State inspection, then Logs, then match your symptom in the Common errors table.
State inspection — always start here¶
NS=nextcloud-demo
NAME=demo
# 1. CRD status + events (events often contain the real reason)
kubectl describe nci $NAME -n $NS
# 2. Conditions (ready / failed / why)
kubectl get nci $NAME -n $NS -o jsonpath='{.status.conditions}' | jq
# 3. Resolved Helm chart version
kubectl get nci $NAME -n $NS -o jsonpath='{.status.versionResolution}' | jq
# 4. Downstream HelmRelease
kubectl describe helmrelease -n $NS
# 5. Pods (anything CrashLoopBackOff / ImagePullBackOff / Pending?)
kubectl get pods -n $NS
kubectl describe pod -n $NS <pod-name>
# 6. Managed database (only if spec.database.managed=true)
kubectl get perconapgcluster -n $NS
kubectl describe perconapgcluster -n $NS
# 7. Secrets created by the operator
kubectl get secret -n $NS -l app.kubernetes.io/managed-by=nextcloud-operator
For the logical Nextcloud tenant resource:
kubectl describe nc $NAME -n $NS
kubectl get nc $NAME -n $NS -o jsonpath='{.status.instanceRef}' | jq
Where are the logs?¶
| Component | Command |
|---|---|
| Operator (kopf handlers) | kubectl logs -n nextcloud-operator-system -l app.kubernetes.io/name=nextcloud-operator --tail=500 |
| Operator — specific handler errors | Same, then grep -i "error\|PermanentError\|TemporaryError" |
| Flux HelmController (who actually installs the chart) | kubectl logs -n flux-system -l app=helm-controller --tail=200 |
Flux SourceController (resolves HelmRepository) |
kubectl logs -n flux-system -l app=source-controller --tail=200 |
| Nextcloud PHP | kubectl logs -n $NS deploy/<release>-nextcloud -c nextcloud --tail=200 |
Nextcloud occ output (on-demand) |
NextcloudCommand CRD; result in .status.results |
| Percona PG operator | kubectl logs -n pgo -l app.kubernetes.io/name=percona-postgresql-operator --tail=200 |
| PostgreSQL itself | kubectl logs -n $NS <pgcluster>-instance1-xxxx-0 -c database --tail=200 |
Make the operator verbose
The operator is started with kopf run --verbose by default in our chart. If you run it differently, increase verbosity with --verbose or --debug to see handler-level reconciliation traces.
Force reconciliation¶
When state looks wrong but no error is visible, nudge the operator to re-run its handlers:
# Force reconcile (operator re-runs all field handlers)
kubectl annotate nci $NAME -n $NS k8s.bnerd.com/reconcile=$(date +%s) --overwrite
# Force maintenance tasks on demand
kubectl annotate nci $NAME -n $NS k8s.bnerd.com/run-maintenance=$(date +%s) --overwrite
See Operations & Annotations for all supported annotations, including k8s.bnerd.com/force-delete for bypassing deletion protection.
Common errors¶
Instance stuck in Pending or Creating¶
Symptom: kubectl get nci shows Pending or Creating for more than 5 minutes.
Check in order:
- Operator logs for validation errors — a
PermanentErrormeans the spec is wrong and no amount of retrying will help. Example: missingspec.admin, invalidspec.database.type, unknownspec.version. - HelmRepository + HelmRelease exist —
kubectl get helmrelease,helmrepository -n $NS. If missing, Flux isn't installed or the operator couldn't reach the K8s API. - Pods
Pending— usually means no defaultStorageClassor the cluster is out of resources.kubectl describe podsurfaces the scheduler's reason. - Pods
ImagePullBackOff— the registry isn't reachable or the image/tag doesn't exist. Checkspec.image/ resolved chart version.
Database not ready yet — managed PG never becomes ready¶
Symptom: Operator logs repeat TemporaryError: Database not ready yet until the 20-minute timeout, then the instance goes Failed.
Causes:
- Percona PG Operator not installed —
kubectl get crd perconapgclusters.pgv2.percona.com. Install viahelm install pgo percona/pg-operator -n pgo --create-namespace. - Percona operator is running but has no RBAC in the target namespace — see its logs:
kubectl logs -n pgo -l app.kubernetes.io/name=percona-postgresql-operator. - No
StorageClass— the PG cluster can't provision PVCs. - Resources too tight — PG instance pods pending because no node has enough CPU/memory.
To recover after fixing the underlying issue, annotate the instance to force reconcile:
HelmRelease stuck or in Failed state¶
Symptom: kubectl get helmrelease -n $NS shows Ready=False.
Diagnose:
kubectl describe helmrelease -n $NS
kubectl logs -n flux-system -l app=helm-controller --tail=200 | grep $NAME
Common HelmRelease failures:
chart "nextcloud" version "x.y.z" not found— the resolved chart version doesn't exist in the repository. Checkstatus.versionResolutionon the instance; if you pinnedspec.helm.version, verify that tag exists in the upstream Helm repo.values don't validate against schema— usually from customspec.helm.values. Test your values locally withhelm template.timed out waiting for the condition— chart installed but pods never went ready. Inspect the pods directly.
To force Flux to retry: flux reconcile helmrelease <release-name> -n $NS --with-source.
Pool instance never gets assigned¶
Symptom: Nextcloud (logical) stays in Assigning phase; NextcloudPool.status.unassigned is 0.
Causes:
- Pool is drained — all instances already assigned. Increase
spec.replicason the pool or wait for the pool reconciler to replenish. - Labels don't match —
spec.poolSelector.matchLabelson theNextclouddoesn't matchtemplate.metadata.labelson the pool. Compare withkubectl get nci -A --show-labels | grep pool. - Pool instances stuck
Pending— the pool is creating replacements but they can't become ready (see "Instance stuck in Pending" above).
Cannot load API key for SignalingServer / RecordingServer¶
Symptom: The operator retries every 60s with this TemporaryError.
The referenced credentialsSecret is missing or lacks the expected key. Check:
The secret must contain the API key under the key specified in the SignalingServer/RecordingServer spec.
Authentication failed for <api-endpoint>: 401¶
Symptom: PermanentError in operator logs when registering a backend.
The API key in the credentialsSecret is wrong or the backend API is rejecting it. Verify the key on the backend (signaling or recording server), update the secret, then delete the SignalingServer / RecordingServer CR to re-register cleanly.
NextcloudCommand times out¶
Symptom: A NextcloudCommand finishes with phase: Failed and status.results[].stderr shows a timeout.
- A single command exceeded
spec.perCommandTimeoutSeconds(default 300s). For expensive migrations likeocc db:convert-filecache-bigint, raise the per-command timeout to e.g.3600. - The overall job exceeded
spec.timeoutSeconds. Raise it or split the commands across multipleNextcloudCommandresources. - No running Nextcloud pod was available — the instance is not
Ready. Checkspec.targetRefpoints to a healthyNextcloudInstance/Nextcloud.
S3 data backup enabled but no repository configured¶
Symptom: PermanentError at instance creation time.
spec.backups.data.enabled: true requires either an S3Backup CRD (from the bnerd backup operator) to be installed, or the backup repository to be configured. Install the backup operator, or disable the feature.
CrashLoopBackOff on Nextcloud pod after an upgrade¶
Symptom: After bumping spec.version, pods crash-loop with migration errors.
Post-upgrade migrations should run automatically via utils/maintenance.py, but you can trigger them manually:
Watch the operator logs to confirm the maintenance tasks ran. If migrations like add-missing-indices or convert-filecache-bigint fail, inspect the output and run them as a dedicated NextcloudCommand with a longer timeout.
Finalizer blocks namespace deletion¶
Symptom: kubectl delete namespace hangs; the namespace stays in Terminating.
A NextcloudInstance still has a finalizer because cleanup is incomplete (typically a managed DB that won't delete). To unblock:
# Check what's blocking
kubectl get nci -n $NS -o jsonpath='{.items[*].metadata.finalizers}'
# Force-delete (skips operator cleanup — only use when you accept losing state)
kubectl annotate nci $NAME -n $NS k8s.bnerd.com/force-delete=true --overwrite
kubectl delete nci $NAME -n $NS
For the full teardown order, audit log, and recreate-safety behaviour, see Deletion & Cleanup.
When to file a bug vs. keep debugging¶
File a bug if:
- The operator panics or the pod crash-loops (
kubectl logsshows a Python traceback without a clearPermanentError). - A
TemporaryErrorrepeats indefinitely even after the underlying cause is fixed. -
Status fields contradict reality (e.g.
phase: Readybut pods areCrashLoopBackOff). After the layered-readiness fix,Readyonly gets set when the HelmRelease, Deployment, Endpoints, and Ingress all report ready — so this combination should no longer be reachable. If you see it, file a bug withkubectl get nci $NAME -n $NS -o yamlattached. -
Instance is stuck in
Deploying: inspectstatus.workloadand theReadycondition'sreason.WaitingForHelmRelease→ check the Flux HelmRelease (kubectl describe helmrelease ...).WaitingForPods→ describe the Nextcloud Deployment; usually a values misconfiguration or PVC problem.WaitingForEndpoints→ the Service has no ready backends (pod ready probe failing).WaitingForIngress→ the cluster's ingress controller hasn't assigned a load-balancer address yet. -
Instance is stuck in
Creatingwithdatabase.managed: true: inspectstatus.databaseand theDatabaseReadycondition.reason=Initializingis normal — Percona PG cluster startup typically takes 1–5 min.reason=ProvisioningFailed→ checkkubectl describe perconapgcluster $NAME-pg -n $NSand look at events.reason=Timeout→ 20 min have passed and the cluster still isn't ready; phase will transition toFailedbut the operator keeps retrying. Common causes: missing StorageClass, pg-operator pod not running, image pull failure. Fix the underlying infra issue and the next 60 s retry should self-heal back toCreating → Deploying → Readywithout operator intervention. -
Instance is at
phase: Failedwithstatus.database.reason=PgOperatorNotFound: the Percona PG operator CRD is not installed in the cluster. This is aPermanentError— install the pg-operator (make install-pg-operatoror your usual Flux/Helm flow) and then re-trigger viakubectl annotate nci $NAME -n $NS k8s.bnerd.com/reconcile=$(date +%s) --overwrite.
Keep debugging yourself if:
- The error message explicitly says what's wrong (missing secret, invalid field, wrong version). The operator is telling you — believe it.
- The HelmRelease is failing — that's a chart/values issue, not an operator bug.
- Kubernetes primitives are broken (no
StorageClass, no ingress, no DNS). Those aren't the operator's job to fix.
See also:
- Operations & Annotations — reconcile, run-maintenance, force-delete
- Monitoring — metrics and dashboards for proactive detection
- Managed PostgreSQL — deeper DB-specific troubleshooting