Production Kubernetes Observability: Prometheus, Grafana, and SLO Engineering
Alert fatigue kills on-call engineers. After building observability stacks at Revantage Asia for 200+ cloud resources, I've refined a pattern that gives you signal, not noise.
The Four Golden Signals (Start Here)
Every service gets these four dashboards, no exceptions:
- Latency — p50, p95, p99 request duration
- Traffic — requests per second (RPS)
- Errors — 4xx and 5xx error rates
- Saturation — CPU, memory, connection pool utilization
Deploying Prometheus Operator via Helm
# Add Prometheus community Helm repo
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack (includes Grafana, AlertManager)
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values values-monitoring.yaml \
--version 55.5.0
# values-monitoring.yaml
prometheus:
prometheusSpec:
retention: 30d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: managed-premium
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
# Scrape all ServiceMonitors/PodMonitors in cluster
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 8Gi
grafana:
adminPassword: "${GRAFANA_ADMIN_PASSWORD}"
persistence:
enabled: true
size: 10Gi
grafana.ini:
server:
domain: grafana.internal.company.com
auth.azure_ad:
enabled: true
allow_sign_up: true
client_id: "${AZURE_CLIENT_ID}"
client_secret: "${AZURE_CLIENT_SECRET}"
scopes: openid email profile
auth_url: "https://login.microsoftonline.com/${TENANT_ID}/oauth2/v2.0/authorize"
token_url: "https://login.microsoftonline.com/${TENANT_ID}/oauth2/v2.0/token"
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: managed-premium
resources:
requests:
storage: 5Gi
ServiceMonitor: Scraping Your Apps
# service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-service-monitor
namespace: production
labels:
release: monitoring # Must match Prometheus selector
spec:
selector:
matchLabels:
app: api-service
endpoints:
- port: metrics
interval: 30s
path: /metrics
scheme: http
tlsConfig:
insecureSkipVerify: false
namespaceSelector:
matchNames:
- production
SLO Engineering: Multi-Window Burn Rate
This is the real game-changer vs naive threshold alerts.
SLO: 99.9% availability over 30 days = 43.2 minutes allowed downtime/month
# slo-alerts.yaml — PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-service-slo
namespace: monitoring
spec:
groups:
- name: api-service.slo.rules
rules:
# Error rate recording rules
- record: job:http_requests_total:rate5m
expr: rate(http_requests_total[5m])
- record: job:http_errors_total:rate5m
expr: rate(http_requests_total{status=~"5.."}[5m])
- record: job:http_error_ratio:rate5m
expr: |
job:http_errors_total:rate5m
/
job:http_requests_total:rate5m
# SLO: 99.9% (error budget = 0.1%)
# Fast burn: 1h window, 14.4x budget burn rate
- alert: APIServiceHighErrorBurnRate
expr: |
(
job:http_error_ratio:rate5m > (14.4 * 0.001)
)
and
(
rate(http_requests_total{status=~"5.."}[1h])
/
rate(http_requests_total[1h])
> (14.4 * 0.001)
)
for: 2m
labels:
severity: critical
team: platform
slo: api-availability
annotations:
summary: "High error burn rate: consuming error budget 14.4x faster"
description: |
Current error rate: {{ $value | humanizePercentage }}
At this rate, monthly error budget exhausted in ~2 hours.
Runbook: https://wiki.internal/runbooks/api-high-error-rate
# Slow burn: 6h window, 6x budget burn rate
- alert: APIServiceMediumErrorBurnRate
expr: |
(
job:http_error_ratio:rate5m > (6 * 0.001)
)
and
(
rate(http_requests_total{status=~"5.."}[6h])
/
rate(http_requests_total[6h])
> (6 * 0.001)
)
for: 15m
labels:
severity: warning
team: platform
annotations:
summary: "Elevated error rate: consuming budget 6x faster than expected"
Alertmanager: Routing Without Fatigue
# alertmanager-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: platform-alerts
namespace: monitoring
spec:
route:
receiver: 'null'
groupBy: ['alertname', 'namespace']
groupWait: 30s
groupInterval: 5m
repeatInterval: 4h
routes:
# Critical: Page immediately
- receiver: pagerduty-critical
match:
severity: critical
repeatInterval: 30m
continue: true
# Warning: Slack only
- receiver: slack-warnings
match:
severity: warning
repeatInterval: 2h
# Informational: No alert (record only)
- receiver: 'null'
match:
severity: info
receivers:
- name: 'null'
- name: pagerduty-critical
pagerdutyConfigs:
- routingKey: "${PAGERDUTY_ROUTING_KEY}"
description: '{{ template "pagerduty.description" . }}'
severity: '{{ if eq .Labels.severity "critical" }}critical{{ else }}warning{{ end }}'
- name: slack-warnings
slackConfigs:
- apiURL: "${SLACK_WEBHOOK_URL}"
channel: '#platform-alerts'
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'
iconEmoji: ':prometheus:'
sendResolved: true
Key Grafana Dashboards to Build
1. Service Health Overview
- RPS, error rate, p99 latency — single pane
- SLO burn rate gauge (remaining budget)
- Deployment markers (correlated with error spikes)
2. Kubernetes Resource Saturation
- Node CPU/memory heatmap by node
- Pod restart frequency (canary for memory leaks)
- PVC usage trending with 7-day forecast
3. Cost per Namespace
- Resource requests vs actual usage (spot waste)
- Idle pods (requests ≠ usage for 24h+)
Lessons from Production
- Alert on burn rate, never raw thresholds — "CPU > 80%" is meaningless, "SLO burn rate 10x" is actionable
- Group alerts aggressively — 50 alerts from one outage = 1 PagerDuty notification, not 50
repeatInterval: 4hminimum — alerts repeating every 30s cause engineers to mute everything- Document runbooks in alert annotations — the engineer woken at 3am needs the link immediately
- Test your alerting quarterly — chaos engineer your alert pipeline, not just your app
Full Helm values and dashboards: github.com/suhail39ahmed/kubernetes-observability-stack