Skip to content

Monitoring and Alerting

The operator exposes Prometheus metrics on port 8080 at /metrics, covering operator lifecycle, backup/restore job outcomes, and error tracking. Pre-built alerting rules and a Grafana dashboard are included.

Scraping Setup

ServiceMonitor (Prometheus Operator)

Enable via Helm:

helm install backup-operator ./charts/backup-operator \
  --set metrics.serviceMonitor.enabled=true

If your Prometheus uses label selectors, add matching labels:

helm install backup-operator ./charts/backup-operator \
  --set metrics.serviceMonitor.enabled=true \
  --set metrics.serviceMonitor.labels.prometheus=kube-prometheus

Manual Scrape Config

scrape_configs:
  - job_name: 'bnerd-backup-operator'
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - bnerd-backup-system
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
        action: keep
        regex: backup-operator
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        action: keep
        regex: metrics

Quick Test

kubectl port-forward -n bnerd-backup-system \
  -l app.kubernetes.io/name=backup-operator 8080:8080
curl -s localhost:8080/metrics | grep bnerd_

Key Metrics

The most important metrics for day-to-day monitoring are the job outcome gauges, updated every 5 minutes by the reconciliation timer:

Metric Type Description
bnerd_volumebackup_last_successful_backup_timestamp Gauge Unix timestamp of most recent successful Job
bnerd_volumebackup_last_failed_backup_timestamp Gauge Unix timestamp of most recent failed Job
bnerd_volumebackup_last_job_success Gauge 1 if last Job succeeded, 0 if failed
bnerd_volumebackup_last_job_duration_seconds Gauge Duration of the most recent completed Job

All four metrics carry these labels:

  • namespace -- the VolumeBackup's namespace
  • volumebackup -- the VolumeBackup's name
  • job_type -- one of backup, check, or restore-test

PromQL Examples

Stale Backup Detection

# No successful backup in the last 24 hours
(time() - bnerd_volumebackup_last_successful_backup_timestamp{job_type="backup"}) > 86400

# No successful backup in the last 48 hours (critical)
(time() - bnerd_volumebackup_last_successful_backup_timestamp{job_type="backup"}) > 172800

Broken Backups

# All backups where the last run failed
bnerd_volumebackup_last_job_success{job_type="backup"} == 0

# Check jobs that are failing
bnerd_volumebackup_last_job_success{job_type="check"} == 0

# Restore tests that are failing
bnerd_volumebackup_last_job_success{job_type="restore-test"} == 0

Backup Duration

# Current backup durations by volume
bnerd_volumebackup_last_job_duration_seconds{job_type="backup"}

# Backups taking longer than 1 hour
bnerd_volumebackup_last_job_duration_seconds{job_type="backup"} > 3600

Operator Health

# Operator uptime
time() - bnerd_backup_operator_start_time_seconds

# Recent restarts
increase(bnerd_backup_operator_restarts_total[1h])

# Error rate by type
sum(rate(bnerd_backup_operator_errors_total[5m])) by (resource_type, operation, error_type)

Restore Monitoring

# Active restores
bnerd_volumerestore_active

# Recent restore failures
rate(bnerd_volumerestore_jobs_failed_total[5m]) > 0

Alerting Rules

Enable via Helm

helm install backup-operator ./charts/backup-operator \
  --set metrics.prometheusRule.enabled=true

Configure thresholds:

helm install backup-operator ./charts/backup-operator \
  --set metrics.prometheusRule.enabled=true \
  --set metrics.prometheusRule.staleThresholdWarning=93600 \
  --set metrics.prometheusRule.staleThresholdCritical=180000

Standalone Apply

kubectl apply -f manifests/prometheus-rules.yaml

Alert Reference

Alert Severity Condition
BackupJobFailed warning Last backup job failed
BackupStale warning No successful backup in 26 hours
BackupStaleCritical critical No successful backup in 50 hours
CheckJobFailed warning Repository integrity check failed
RestoreTestFailed warning Automated restore test failed
BackupSlowJob info Backup job took longer than 2 hours

Custom Rules

Add custom alerting rules alongside the defaults in your Helm values:

metrics:
  prometheusRule:
    enabled: true
    additionalRules:
      - alert: MyCustomAlert
        expr: |
          my_custom_metric > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Custom alert fired"

Grafana Dashboard

A ready-to-import Grafana dashboard is available at examples/grafana-dashboard.json.

Import

  1. In Grafana, go to Dashboards > Import
  2. Upload or paste the JSON file
  3. Select your Prometheus data source
  4. Click Import

The dashboard includes panels for:

  • Backup success/failure overview (stat panel)
  • Time since last successful backup per volume
  • Backup duration trends
  • Check and restore-test status
  • Operator health and error rates
  • Reconciliation performance

Full Metrics Reference

For the complete list of all metrics, see the API Reference overview.