Monitoring¶

The operator exports Prometheus metrics about its own reconciliation loops and the state of every managed Nextcloud, NextcloudInstance, NextcloudPool and HelmRelease. A ServiceMonitor and two Grafana dashboards (Overview + Detail) ship with the chart and as optional flat-YAML manifests.

Metrics are opt-in on both install paths.

Quick start¶

Helm install¶

helm upgrade --install nextcloud-operator ./chart \
  --set metrics.enabled=true \
  --set metrics.serviceMonitor.enabled=true \
  --set metrics.grafana.enabled=true

This creates:

Container port metrics/9090 and environment variables on the Deployment
A Service exposing port 9090
A ServiceMonitor (requires the Prometheus Operator CRDs)
Two ConfigMaps, one per dashboard, labelled grafana_dashboard: "1" so the Grafana sidecar picks them up automatically

Flat YAML install¶

# 1. Enable metrics in the operator deployment (edit METRICS_ENABLED to "true")
kubectl -n nextcloud-operator-system set env deploy/nextcloud-operator METRICS_ENABLED=true

# 2. Apply the monitoring stack
kubectl apply -f deploy/monitoring.yaml \
              -f deploy/dashboard-overview.yaml \
              -f deploy/dashboard-detail.yaml

deploy/monitoring.yaml contains the Service and ServiceMonitor. The two dashboard ConfigMaps are regenerated from the chart via make sync-dashboards.

Prerequisites¶

Component	Required for
Prometheus + prometheus-operator CRDs	ServiceMonitor scraping
Grafana with the sidecar dashboard loader	Auto-loading the shipped dashboards

If you run a stand-alone Grafana, import the JSON from chart/dashboards/ directly.

Exposed metrics¶

All metrics are prefixed with nextcloud_operator_.

State gauges (updated by the background collector every 60s)¶

Metric	Type	Labels	Meaning
`nextclouds_total`	Gauge	`phase`	Nextcloud CR count by phase
`nextcloudinstances_total`	Gauge	`phase`	NextcloudInstance CR count by phase
`nextcloudpools_total`	Gauge	`phase`	NextcloudPool count by phase
`nextcloudprofiles_total`	Gauge	—	NextcloudProfile count
`pool_replicas`	Gauge	`pool`, `type` (desired/ready/unassigned/assigned)	Per-pool replica breakdown
`nextcloud_info`	Gauge	`name`, `namespace`, `phase`, `instance_name`, `instance_namespace`, `profile`, `url`	Per-CR identity (always 1)
`nextcloudinstance_info`	Gauge	`name`, `namespace`, `phase`, `assigned_to` (owning Nextcloud name, empty when spare), `pool`, `profile`, `url`	Per-NCI identity (always 1)
`nextcloudpool_info`	Gauge	`name`, `phase`, `desired`, `ready`, `unassigned`, `assigned`	Per-pool identity (always 1)
`assignment_info`	Gauge	`nextcloud_namespace`, `nextcloud_name`, `nextcloud_url`, `instance_namespace`, `instance_name`, `profile`, `pool`, `state` (`assigned`/`spare`)	One series per NextcloudInstance. Lets dashboards JOIN tenant identifiers (URL, name) onto K8s workload metrics by `instance_namespace`. Spare pool instances appear with empty `nextcloud_*` labels.
`nextcloud_condition`	Gauge	`name`, `namespace`, `type`	Condition state (1/0/-1 = True/False/Unknown)
`nextcloudinstance_condition`	Gauge	`name`, `namespace`, `type`	Condition state (1/0/-1 = True/False/Unknown)
`helmrelease_ready`	Gauge	`namespace`, `name`	HelmRelease Ready condition (1/0/-1)

Event counters¶

Metric	Type	Labels	Meaning
`reconcile_total`	Counter	`resource`, `result` (success/error/temporary_error/permanent_error)	One increment per kopf handler invocation
`errors_total`	Counter	`resource`, `stage` (validation/db_provision/helmrelease/occ/maintenance/…)	Categorised error accounting
`pool_scale_total`	Counter	`pool`, `direction` (up/down)	Pool instance create/delete events
`pool_assignment_total`	Counter	`pool`, `result` (success/conflict/no_match)	Outcome of pool match attempts
`maintenance_task_total`	Counter	`task`, `result` (success/error)	Periodic and post-upgrade OCC task runs

Latency histograms¶

Metric	Labels	Covers
`operation_duration_seconds`	`operation`, `result`	Generic operation timer; used for `db_provision`, `helmrelease_create_or_update`, `occ_command`, `maintenance_task`
`instance_ready_duration_seconds`	`profile`	Seconds from NCI creation to `phase=Ready`
`nextcloud_assignment_duration_seconds`	`pool`	Seconds from Nextcloud creation to first pool assignment

Dashboards¶

Overview (`uid: nextcloud-operator-overview`)¶

Fleet-wide view. Panels:

Totals (Nextclouds, NCIs, Ready, Failed)
Phase distribution (pie)
Reconciliation rate per resource/result
Error rate per resource/stage
instance_ready_duration_seconds p50/p95/p99
Operation p95 by operation
Pool replicas + assignment rate
HelmRelease Ready/Not-Ready/Unknown counts
Failed NextcloudInstances table
Tenant ↔ Instance Assignment table (joins tenant URL/name to instance via assignment_info)

Detail (`uid: nextcloud-operator-detail`)¶

Per-instance drill-down. Template variables: $namespace, $instance. Panels are filtered by these variables where per-instance labels exist (info and condition gauges, HelmRelease gauge). Reconcile and error series are shown at the operator level for context — they don't carry name labels by design to keep cardinality bounded.

Tuning¶

metrics.collectorInterval (Helm) / METRICS_COLLECTOR_INTERVAL (env): Background collector frequency in seconds. Default 60. Drop to 15–30 if you want faster dashboard refresh; raise to 300 for large fleets.
metrics.serviceMonitor.interval: Prometheus scrape interval. Default 60s.

Cardinality notes¶

Per-instance labels exist only on state gauges (*_info, *_condition, helmrelease_ready, assignment_info). Counters and histograms carry resource / operation / stage / pool / profile — bounded by the number of pools, profiles and operation kinds, not by the number of tenants. This keeps scrape volume flat as fleet size grows.

assignment_info emits exactly one series per NextcloudInstance, so its cardinality equals fleet size. The nextcloud_url label is high-cardinality but bounded by tenant count and only changes when a customer renames their URL — suitable for table panels and ad-hoc joins, not for use in PromQL by () groupings.

Troubleshooting¶

Metrics endpoint returns 404: METRICS_ENABLED is not true on the Deployment, or the collector server failed to start (check operator logs for "Prometheus metrics server started").
ServiceMonitor not picked up: check that the label selector on the Service (app.kubernetes.io/name: nextcloud-operator) matches what the ServiceMonitor expects, and that the Prometheus instance's serviceMonitorSelector allows the chart's labels.
Dashboards don't appear in Grafana: the sidecar only scans ConfigMaps labelled grafana_dashboard: "1". Override with metrics.grafana.dashboardLabels if your Grafana uses a different label.