Monitoring¶
The operator exports Prometheus metrics about its own reconciliation loops and the state of every managed Nextcloud, NextcloudInstance, NextcloudPool and HelmRelease. A ServiceMonitor and two Grafana dashboards (Overview + Detail) ship with the chart and as optional flat-YAML manifests.
Metrics are opt-in on both install paths.
Quick start¶
Helm install¶
helm upgrade --install nextcloud-operator ./chart \
--set metrics.enabled=true \
--set metrics.serviceMonitor.enabled=true \
--set metrics.grafana.enabled=true
This creates:
- Container port
metrics/9090and environment variables on the Deployment - A
Serviceexposing port9090 - A
ServiceMonitor(requires the Prometheus Operator CRDs) - Two
ConfigMaps, one per dashboard, labelledgrafana_dashboard: "1"so the Grafana sidecar picks them up automatically
Flat YAML install¶
# 1. Enable metrics in the operator deployment (edit METRICS_ENABLED to "true")
kubectl -n nextcloud-operator-system set env deploy/nextcloud-operator METRICS_ENABLED=true
# 2. Apply the monitoring stack
kubectl apply -f deploy/monitoring.yaml \
-f deploy/dashboard-overview.yaml \
-f deploy/dashboard-detail.yaml
deploy/monitoring.yaml contains the Service and ServiceMonitor. The two
dashboard ConfigMaps are regenerated from the chart via make sync-dashboards.
Prerequisites¶
| Component | Required for |
|---|---|
| Prometheus + prometheus-operator CRDs | ServiceMonitor scraping |
| Grafana with the sidecar dashboard loader | Auto-loading the shipped dashboards |
If you run a stand-alone Grafana, import the JSON from chart/dashboards/
directly.
Exposed metrics¶
All metrics are prefixed with nextcloud_operator_.
State gauges (updated by the background collector every 60s)¶
| Metric | Type | Labels | Meaning |
|---|---|---|---|
nextclouds_total |
Gauge | phase |
Nextcloud CR count by phase |
nextcloudinstances_total |
Gauge | phase |
NextcloudInstance CR count by phase |
nextcloudpools_total |
Gauge | phase |
NextcloudPool count by phase |
nextcloudprofiles_total |
Gauge | — | NextcloudProfile count |
pool_replicas |
Gauge | pool, type (desired/ready/unassigned/assigned) |
Per-pool replica breakdown |
nextcloud_info |
Gauge | name, namespace, phase, instance_name, instance_namespace, profile, url |
Per-CR identity (always 1) |
nextcloudinstance_info |
Gauge | name, namespace, phase, assigned_to (owning Nextcloud name, empty when spare), pool, profile, url |
Per-NCI identity (always 1) |
nextcloudpool_info |
Gauge | name, phase, desired, ready, unassigned, assigned |
Per-pool identity (always 1) |
assignment_info |
Gauge | nextcloud_namespace, nextcloud_name, nextcloud_url, instance_namespace, instance_name, profile, pool, state (assigned/spare) |
One series per NextcloudInstance. Lets dashboards JOIN tenant identifiers (URL, name) onto K8s workload metrics by instance_namespace. Spare pool instances appear with empty nextcloud_* labels. |
nextcloud_condition |
Gauge | name, namespace, type |
Condition state (1/0/-1 = True/False/Unknown) |
nextcloudinstance_condition |
Gauge | name, namespace, type |
Condition state (1/0/-1 = True/False/Unknown) |
helmrelease_ready |
Gauge | namespace, name |
HelmRelease Ready condition (1/0/-1) |
Event counters¶
| Metric | Type | Labels | Meaning |
|---|---|---|---|
reconcile_total |
Counter | resource, result (success/error/temporary_error/permanent_error) |
One increment per kopf handler invocation |
errors_total |
Counter | resource, stage (validation/db_provision/helmrelease/occ/maintenance/…) |
Categorised error accounting |
pool_scale_total |
Counter | pool, direction (up/down) |
Pool instance create/delete events |
pool_assignment_total |
Counter | pool, result (success/conflict/no_match) |
Outcome of pool match attempts |
maintenance_task_total |
Counter | task, result (success/error) |
Periodic and post-upgrade OCC task runs |
Latency histograms¶
| Metric | Labels | Covers |
|---|---|---|
operation_duration_seconds |
operation, result |
Generic operation timer; used for db_provision, helmrelease_create_or_update, occ_command, maintenance_task |
instance_ready_duration_seconds |
profile |
Seconds from NCI creation to phase=Ready |
nextcloud_assignment_duration_seconds |
pool |
Seconds from Nextcloud creation to first pool assignment |
Dashboards¶
Overview (uid: nextcloud-operator-overview)¶
Fleet-wide view. Panels:
- Totals (Nextclouds, NCIs, Ready, Failed)
- Phase distribution (pie)
- Reconciliation rate per resource/result
- Error rate per resource/stage
instance_ready_duration_secondsp50/p95/p99- Operation p95 by operation
- Pool replicas + assignment rate
- HelmRelease Ready/Not-Ready/Unknown counts
- Failed NextcloudInstances table
- Tenant ↔ Instance Assignment table (joins tenant URL/name to instance via
assignment_info)
Detail (uid: nextcloud-operator-detail)¶
Per-instance drill-down. Template variables: $namespace, $instance.
Panels are filtered by these variables where per-instance labels exist (info
and condition gauges, HelmRelease gauge). Reconcile and error series are shown
at the operator level for context — they don't carry name labels by design
to keep cardinality bounded.
Tuning¶
metrics.collectorInterval(Helm) /METRICS_COLLECTOR_INTERVAL(env): Background collector frequency in seconds. Default 60. Drop to 15–30 if you want faster dashboard refresh; raise to 300 for large fleets.metrics.serviceMonitor.interval: Prometheus scrape interval. Default 60s.
Cardinality notes¶
Per-instance labels exist only on state gauges (*_info, *_condition,
helmrelease_ready, assignment_info). Counters and histograms carry
resource / operation / stage / pool / profile — bounded by the number
of pools, profiles and operation kinds, not by the number of tenants. This keeps
scrape volume flat as fleet size grows.
assignment_info emits exactly one series per NextcloudInstance, so its
cardinality equals fleet size. The nextcloud_url label is high-cardinality but
bounded by tenant count and only changes when a customer renames their URL —
suitable for table panels and ad-hoc joins, not for use in PromQL by ()
groupings.
Troubleshooting¶
- Metrics endpoint returns 404:
METRICS_ENABLEDis nottrueon the Deployment, or the collector server failed to start (check operator logs for "Prometheus metrics server started"). - ServiceMonitor not picked up: check that the label selector on the
Service (
app.kubernetes.io/name: nextcloud-operator) matches what the ServiceMonitor expects, and that the Prometheus instance'sserviceMonitorSelectorallows the chart's labels. - Dashboards don't appear in Grafana: the sidecar only scans ConfigMaps
labelled
grafana_dashboard: "1". Override withmetrics.grafana.dashboardLabelsif your Grafana uses a different label.