Monitoring and Alerting¶

The operator exposes Prometheus metrics on port 8080 at /metrics, covering operator lifecycle, backup/restore job outcomes, and error tracking. Pre-built alerting rules and a Grafana dashboard are included.

Scraping Setup¶

ServiceMonitor (Prometheus Operator)¶

Enable via Helm:

helm install backup-operator ./charts/backup-operator \
  --set metrics.serviceMonitor.enabled=true

If your Prometheus uses label selectors, add matching labels:

helm install backup-operator ./charts/backup-operator \
  --set metrics.serviceMonitor.enabled=true \
  --set metrics.serviceMonitor.labels.prometheus=kube-prometheus

Manual Scrape Config¶

scrape_configs:
  - job_name: 'bnerd-backup-operator'
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - bnerd-backup-system
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
        action: keep
        regex: backup-operator
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        action: keep
        regex: metrics

Quick Test¶

kubectl port-forward -n bnerd-backup-system \
  -l app.kubernetes.io/name=backup-operator 8080:8080
curl -s localhost:8080/metrics | grep bnerd_

Key Metrics¶

The most important metrics for day-to-day monitoring are the job outcome gauges, updated every 5 minutes by the reconciliation timer:

Metric	Type	Description
`bnerd_volumebackup_last_successful_backup_timestamp`	Gauge	Unix timestamp of most recent successful Job
`bnerd_volumebackup_last_failed_backup_timestamp`	Gauge	Unix timestamp of most recent failed Job
`bnerd_volumebackup_last_job_success`	Gauge	`1` if last Job succeeded, `0` if failed
`bnerd_volumebackup_last_job_duration_seconds`	Gauge	Duration of the most recent completed Job

All four metrics carry these labels:

namespace -- the VolumeBackup's namespace
volumebackup -- the VolumeBackup's name
job_type -- one of backup, check, or restore-test

PromQL Examples¶

Stale Backup Detection¶

# No successful backup in the last 24 hours
(time() - bnerd_volumebackup_last_successful_backup_timestamp{job_type="backup"}) > 86400

# No successful backup in the last 48 hours (critical)
(time() - bnerd_volumebackup_last_successful_backup_timestamp{job_type="backup"}) > 172800

Broken Backups¶

# All backups where the last run failed
bnerd_volumebackup_last_job_success{job_type="backup"} == 0

# Check jobs that are failing
bnerd_volumebackup_last_job_success{job_type="check"} == 0

# Restore tests that are failing
bnerd_volumebackup_last_job_success{job_type="restore-test"} == 0

Backup Duration¶

# Current backup durations by volume
bnerd_volumebackup_last_job_duration_seconds{job_type="backup"}

# Backups taking longer than 1 hour
bnerd_volumebackup_last_job_duration_seconds{job_type="backup"} > 3600

Operator Health¶

# Operator uptime
time() - bnerd_backup_operator_start_time_seconds

# Recent restarts
increase(bnerd_backup_operator_restarts_total[1h])

# Error rate by type
sum(rate(bnerd_backup_operator_errors_total[5m])) by (resource_type, operation, error_type)

Restore Monitoring¶

# Active restores
bnerd_volumerestore_active

# Recent restore failures
rate(bnerd_volumerestore_jobs_failed_total[5m]) > 0

Alerting Rules¶

Enable via Helm¶

helm install backup-operator ./charts/backup-operator \
  --set metrics.prometheusRule.enabled=true

Configure thresholds:

helm install backup-operator ./charts/backup-operator \
  --set metrics.prometheusRule.enabled=true \
  --set metrics.prometheusRule.staleThresholdWarning=93600 \
  --set metrics.prometheusRule.staleThresholdCritical=180000

Standalone Apply¶

kubectl apply -f manifests/prometheus-rules.yaml

Alert Reference¶

Alert	Severity	Condition
`BackupJobFailed`	warning	Last backup job failed
`BackupStale`	warning	No successful backup in 26 hours
`BackupStaleCritical`	critical	No successful backup in 50 hours
`CheckJobFailed`	warning	Repository integrity check failed
`RestoreTestFailed`	warning	Automated restore test failed
`BackupSlowJob`	info	Backup job took longer than 2 hours

Custom Rules¶

Add custom alerting rules alongside the defaults in your Helm values:

metrics:
  prometheusRule:
    enabled: true
    additionalRules:
      - alert: MyCustomAlert
        expr: |
          my_custom_metric > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Custom alert fired"

Grafana Dashboard¶

A ready-to-import Grafana dashboard is available at examples/grafana-dashboard.json.

Import¶

In Grafana, go to Dashboards > Import
Upload or paste the JSON file
Select your Prometheus data source
Click Import

The dashboard includes panels for:

Backup success/failure overview (stat panel)
Time since last successful backup per volume
Backup duration trends
Check and restore-test status
Operator health and error rates
Reconciliation performance

Full Metrics Reference¶

For the complete list of all metrics, see the API Reference overview.