Skip to content

Monitoring

The operator exports Prometheus metrics about its own reconciliation loops and the state of every managed Nextcloud, NextcloudInstance, NextcloudPool and HelmRelease. A ServiceMonitor and two Grafana dashboards (Overview + Detail) ship with the chart and as optional flat-YAML manifests.

Metrics are opt-in on both install paths.

Quick start

Helm install

helm upgrade --install nextcloud-operator ./chart \
  --set metrics.enabled=true \
  --set metrics.serviceMonitor.enabled=true \
  --set metrics.grafana.enabled=true

This creates:

  • Container port metrics/9090 and environment variables on the Deployment
  • A Service exposing port 9090
  • A ServiceMonitor (requires the Prometheus Operator CRDs)
  • Two ConfigMaps, one per dashboard, labelled grafana_dashboard: "1" so the Grafana sidecar picks them up automatically

Flat YAML install

# 1. Enable metrics in the operator deployment (edit METRICS_ENABLED to "true")
kubectl -n nextcloud-operator-system set env deploy/nextcloud-operator METRICS_ENABLED=true

# 2. Apply the monitoring stack
kubectl apply -f deploy/monitoring.yaml \
              -f deploy/dashboard-overview.yaml \
              -f deploy/dashboard-detail.yaml

deploy/monitoring.yaml contains the Service and ServiceMonitor. The two dashboard ConfigMaps are regenerated from the chart via make sync-dashboards.

Prerequisites

Component Required for
Prometheus + prometheus-operator CRDs ServiceMonitor scraping
Grafana with the sidecar dashboard loader Auto-loading the shipped dashboards

If you run a stand-alone Grafana, import the JSON from chart/dashboards/ directly.

Exposed metrics

All metrics are prefixed with nextcloud_operator_.

State gauges (updated by the background collector every 60s)

Metric Type Labels Meaning
nextclouds_total Gauge phase Nextcloud CR count by phase
nextcloudinstances_total Gauge phase NextcloudInstance CR count by phase
nextcloudpools_total Gauge phase NextcloudPool count by phase
nextcloudprofiles_total Gauge NextcloudProfile count
pool_replicas Gauge pool, type (desired/ready/unassigned/assigned) Per-pool replica breakdown
nextcloud_info Gauge name, namespace, phase, instance_name, instance_namespace, profile, url Per-CR identity (always 1)
nextcloudinstance_info Gauge name, namespace, phase, assigned_to (owning Nextcloud name, empty when spare), pool, profile, url Per-NCI identity (always 1)
nextcloudpool_info Gauge name, phase, desired, ready, unassigned, assigned Per-pool identity (always 1)
assignment_info Gauge nextcloud_namespace, nextcloud_name, nextcloud_url, instance_namespace, instance_name, profile, pool, state (assigned/spare) One series per NextcloudInstance. Lets dashboards JOIN tenant identifiers (URL, name) onto K8s workload metrics by instance_namespace. Spare pool instances appear with empty nextcloud_* labels.
nextcloud_condition Gauge name, namespace, type Condition state (1/0/-1 = True/False/Unknown)
nextcloudinstance_condition Gauge name, namespace, type Condition state (1/0/-1 = True/False/Unknown)
helmrelease_ready Gauge namespace, name HelmRelease Ready condition (1/0/-1)

Event counters

Metric Type Labels Meaning
reconcile_total Counter resource, result (success/error/temporary_error/permanent_error) One increment per kopf handler invocation
errors_total Counter resource, stage (validation/db_provision/helmrelease/occ/maintenance/…) Categorised error accounting
pool_scale_total Counter pool, direction (up/down) Pool instance create/delete events
pool_assignment_total Counter pool, result (success/conflict/no_match) Outcome of pool match attempts
maintenance_task_total Counter task, result (success/error) Periodic and post-upgrade OCC task runs

Latency histograms

Metric Labels Covers
operation_duration_seconds operation, result Generic operation timer; used for db_provision, helmrelease_create_or_update, occ_command, maintenance_task
instance_ready_duration_seconds profile Seconds from NCI creation to phase=Ready
nextcloud_assignment_duration_seconds pool Seconds from Nextcloud creation to first pool assignment

Dashboards

Overview (uid: nextcloud-operator-overview)

Fleet-wide view. Panels:

  • Totals (Nextclouds, NCIs, Ready, Failed)
  • Phase distribution (pie)
  • Reconciliation rate per resource/result
  • Error rate per resource/stage
  • instance_ready_duration_seconds p50/p95/p99
  • Operation p95 by operation
  • Pool replicas + assignment rate
  • HelmRelease Ready/Not-Ready/Unknown counts
  • Failed NextcloudInstances table
  • Tenant ↔ Instance Assignment table (joins tenant URL/name to instance via assignment_info)

Detail (uid: nextcloud-operator-detail)

Per-instance drill-down. Template variables: $namespace, $instance. Panels are filtered by these variables where per-instance labels exist (info and condition gauges, HelmRelease gauge). Reconcile and error series are shown at the operator level for context — they don't carry name labels by design to keep cardinality bounded.

Tuning

  • metrics.collectorInterval (Helm) / METRICS_COLLECTOR_INTERVAL (env): Background collector frequency in seconds. Default 60. Drop to 15–30 if you want faster dashboard refresh; raise to 300 for large fleets.
  • metrics.serviceMonitor.interval: Prometheus scrape interval. Default 60s.

Cardinality notes

Per-instance labels exist only on state gauges (*_info, *_condition, helmrelease_ready, assignment_info). Counters and histograms carry resource / operation / stage / pool / profile — bounded by the number of pools, profiles and operation kinds, not by the number of tenants. This keeps scrape volume flat as fleet size grows.

assignment_info emits exactly one series per NextcloudInstance, so its cardinality equals fleet size. The nextcloud_url label is high-cardinality but bounded by tenant count and only changes when a customer renames their URL — suitable for table panels and ad-hoc joins, not for use in PromQL by () groupings.

Troubleshooting

  • Metrics endpoint returns 404: METRICS_ENABLED is not true on the Deployment, or the collector server failed to start (check operator logs for "Prometheus metrics server started").
  • ServiceMonitor not picked up: check that the label selector on the Service (app.kubernetes.io/name: nextcloud-operator) matches what the ServiceMonitor expects, and that the Prometheus instance's serviceMonitorSelector allows the chart's labels.
  • Dashboards don't appear in Grafana: the sidecar only scans ConfigMaps labelled grafana_dashboard: "1". Override with metrics.grafana.dashboardLabels if your Grafana uses a different label.