Monitoring and Alerting¶
The operator exposes Prometheus metrics on port 8080 at /metrics, covering operator lifecycle, backup/restore job outcomes, and error tracking. Pre-built alerting rules and a Grafana dashboard are included.
Scraping Setup¶
ServiceMonitor (Prometheus Operator)¶
Enable via Helm:
If your Prometheus uses label selectors, add matching labels:
helm install backup-operator ./charts/backup-operator \
--set metrics.serviceMonitor.enabled=true \
--set metrics.serviceMonitor.labels.prometheus=kube-prometheus
Manual Scrape Config¶
scrape_configs:
- job_name: 'bnerd-backup-operator'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- bnerd-backup-system
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
action: keep
regex: backup-operator
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: metrics
Quick Test¶
kubectl port-forward -n bnerd-backup-system \
-l app.kubernetes.io/name=backup-operator 8080:8080
curl -s localhost:8080/metrics | grep bnerd_
Key Metrics¶
The most important metrics for day-to-day monitoring are the job outcome gauges, updated every 5 minutes by the reconciliation timer:
| Metric | Type | Description |
|---|---|---|
bnerd_volumebackup_last_successful_backup_timestamp |
Gauge | Unix timestamp of most recent successful Job |
bnerd_volumebackup_last_failed_backup_timestamp |
Gauge | Unix timestamp of most recent failed Job |
bnerd_volumebackup_last_job_success |
Gauge | 1 if last Job succeeded, 0 if failed |
bnerd_volumebackup_last_job_duration_seconds |
Gauge | Duration of the most recent completed Job |
All four metrics carry these labels:
namespace-- the VolumeBackup's namespacevolumebackup-- the VolumeBackup's namejob_type-- one ofbackup,check, orrestore-test
PromQL Examples¶
Stale Backup Detection¶
# No successful backup in the last 24 hours
(time() - bnerd_volumebackup_last_successful_backup_timestamp{job_type="backup"}) > 86400
# No successful backup in the last 48 hours (critical)
(time() - bnerd_volumebackup_last_successful_backup_timestamp{job_type="backup"}) > 172800
Broken Backups¶
# All backups where the last run failed
bnerd_volumebackup_last_job_success{job_type="backup"} == 0
# Check jobs that are failing
bnerd_volumebackup_last_job_success{job_type="check"} == 0
# Restore tests that are failing
bnerd_volumebackup_last_job_success{job_type="restore-test"} == 0
Backup Duration¶
# Current backup durations by volume
bnerd_volumebackup_last_job_duration_seconds{job_type="backup"}
# Backups taking longer than 1 hour
bnerd_volumebackup_last_job_duration_seconds{job_type="backup"} > 3600
Operator Health¶
# Operator uptime
time() - bnerd_backup_operator_start_time_seconds
# Recent restarts
increase(bnerd_backup_operator_restarts_total[1h])
# Error rate by type
sum(rate(bnerd_backup_operator_errors_total[5m])) by (resource_type, operation, error_type)
Restore Monitoring¶
# Active restores
bnerd_volumerestore_active
# Recent restore failures
rate(bnerd_volumerestore_jobs_failed_total[5m]) > 0
Alerting Rules¶
Enable via Helm¶
Configure thresholds:
helm install backup-operator ./charts/backup-operator \
--set metrics.prometheusRule.enabled=true \
--set metrics.prometheusRule.staleThresholdWarning=93600 \
--set metrics.prometheusRule.staleThresholdCritical=180000
Standalone Apply¶
Alert Reference¶
| Alert | Severity | Condition |
|---|---|---|
BackupJobFailed |
warning | Last backup job failed |
BackupStale |
warning | No successful backup in 26 hours |
BackupStaleCritical |
critical | No successful backup in 50 hours |
CheckJobFailed |
warning | Repository integrity check failed |
RestoreTestFailed |
warning | Automated restore test failed |
BackupSlowJob |
info | Backup job took longer than 2 hours |
Custom Rules¶
Add custom alerting rules alongside the defaults in your Helm values:
metrics:
prometheusRule:
enabled: true
additionalRules:
- alert: MyCustomAlert
expr: |
my_custom_metric > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Custom alert fired"
Grafana Dashboard¶
A ready-to-import Grafana dashboard is available at examples/grafana-dashboard.json.
Import¶
- In Grafana, go to Dashboards > Import
- Upload or paste the JSON file
- Select your Prometheus data source
- Click Import
The dashboard includes panels for:
- Backup success/failure overview (stat panel)
- Time since last successful backup per volume
- Backup duration trends
- Check and restore-test status
- Operator health and error rates
- Reconciliation performance
Full Metrics Reference¶
For the complete list of all metrics, see the API Reference overview.