Site Reliability Engineers face a fundamental challenge: monitoring complex distributed systems without drowning in metrics noise. Google’s Golden Four signals provide a battle-tested framework for focusing on what truly matters for service reliability. In this comprehensive guide, we’ll walk through practical implementations using Prometheus, Grafana, and Datadog, complete with production-ready configurations and real-world examples.
Prerequisites
Before diving into implementations, ensure you have:
- Basic understanding of Kubernetes and container orchestration
- Familiarity with Prometheus metrics and PromQL queries
- Access to a Kubernetes cluster (kind, minikube, or cloud-managed)
- Basic knowledge of HTTP status codes and API design principles
Estimated implementation time: 2-4 hours depending on your existing monitoring setup.
Understanding the Golden Four Signals
The Golden Four signals, introduced in Google’s SRE book, represent the minimum viable monitoring for any user-facing system:
1. Latency
Definition: The time it takes to service a request, with important distinction between successful and failed requests.
Why it matters: High latency directly impacts user experience. A 100ms delay can result in 1% drop in sales for e-commerce platforms.
Key considerations:
- Measure both successful and error response latencies separately
- Focus on percentiles (P50, P95, P99) rather than averages
- Different request types may have different latency expectations
2. Traffic
Definition: A measure of how much demand is being placed on your system, typically measured in HTTP requests per second.
Why it matters: Understanding traffic patterns helps with capacity planning and identifying unusual behavior patterns.
Key considerations:
- Measure by request type, endpoint, or service
- Consider both inbound and outbound traffic for microservices
- Account for different traffic patterns (batch vs. real-time)
3. Errors
Definition: The rate of requests that fail, either explicitly (HTTP 500s) or implicitly (HTTP 200 with wrong content).
Why it matters: Error rates directly correlate with user satisfaction and can indicate underlying system issues.
Key considerations:
- Distinguish between client errors (4xx) and server errors (5xx)
- Include application-level errors, not just HTTP errors
- Consider partial failures in distributed systems
4. Saturation
Definition: How “full” your service is, measuring the utilization of your most constrained resource.
Why it matters: Saturation often predicts performance problems before they manifest as latency or errors.
Key considerations:
- Identify your bottleneck resource (CPU, memory, disk I/O, network)
- Set thresholds before performance degrades (typically 80-85%)
- Consider both current utilization and growth trends
Architecture Overview
Our implementation uses a modern observability stack:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Application │───▶│ Prometheus │───▶│ Grafana │
│ (Metrics) │ │ (Collection) │ │ (Visualization) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ AlertManager │ │
│ │ (Alerting) │ │
│ └──────────────────┘ │
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Application │ │ Runbooks & │
│ Logs │ │ Dashboards │
└─────────────────┘ └─────────────────┘
Prometheus Implementation
Basic Metrics Collection
First, let’s set up Prometheus to collect the Golden Four signals. Here’s our complete Prometheus configuration:
# prometheus-config.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerting-rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'application-metrics'
static_configs:
- targets: ['your-app:8080']
metrics_path: '/metrics'
scrape_interval: 5s
Recording Rules for Golden Signals
Create recording rules to pre-compute common Golden Four signal queries:
# recording-rules.yml
groups:
- name: golden_signals
interval: 30s
rules:
# Latency Rules
- record: http_request_duration_seconds:p50
expr: histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, instance, method, le))
- record: http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, instance, method, le))
- record: http_request_duration_seconds:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, instance, method, le))
# Traffic Rules
- record: http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job, instance, method, status)
- record: http_requests:rate1h
expr: sum(rate(http_requests_total[1h])) by (job, instance, method, status)
# Error Rate Rules
- record: http_requests:error_rate5m
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, instance) / sum(rate(http_requests_total[5m])) by (job, instance)
- record: http_requests:error_rate1h
expr: sum(rate(http_requests_total{status=~"5.."}[1h])) by (job, instance) / sum(rate(http_requests_total[1h])) by (job, instance)
# Saturation Rules
- record: node_cpu:utilization
expr: 1 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))
- record: node_memory:utilization
expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
- record: node_filesystem:utilization
expr: 1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})
Application Metrics Integration
For your applications to expose Golden Four signals, implement these metrics in your code. Here’s a Go example using Prometheus client library:
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// Latency - HTTP request duration histogram
httpDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Duration of HTTP requests in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint", "status_code"},
)
// Traffic - HTTP request counter
httpRequests = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status_code"},
)
// Saturation - In-flight requests gauge
httpInFlight = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "http_requests_in_flight",
Help: "Current number of HTTP requests being processed",
},
)
)
func init() {
prometheus.MustRegister(httpDuration, httpRequests, httpInFlight)
}
func instrumentHandler(next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
httpInFlight.Inc()
defer httpInFlight.Dec()
// Capture response status
rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
next(rw, r)
duration := time.Since(start).Seconds()
status := string(rw.statusCode)
httpDuration.WithLabelValues(r.Method, r.URL.Path, status).Observe(duration)
httpRequests.WithLabelValues(r.Method, r.URL.Path, status).Inc()
}
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
Grafana Dashboard Implementation
Create a comprehensive Grafana dashboard that visualizes all four Golden Signals. The dashboard JSON is available in the accompanying grafana-dashboard.json file, but here are the key panels:
Latency Panel Configuration
{
"title": "Request Latency Percentiles",
"type": "graph",
"targets": [
{
"expr": "http_request_duration_seconds:p50",
"legendFormat": "p50 - {{instance}}"
},
{
"expr": "http_request_duration_seconds:p95",
"legendFormat": "p95 - {{instance}}"
},
{
"expr": "http_request_duration_seconds:p99",
"legendFormat": "p99 - {{instance}}"
}
],
"yAxes": [{
"label": "Seconds",
"logBase": 1,
"min": 0
}],
"alert": {
"conditions": [{
"query": {"queryType": "", "refId": "B"},
"reducer": {"type": "last", "params": []},
"evaluator": {"params": [0.5], "type": "gt"}
}],
"executionErrorState": "alerting",
"frequency": "10s",
"handler": 1,
"name": "High Latency Alert",
"noDataState": "no_data"
}
}
Traffic Panel Configuration
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(http_requests:rate5m) by (instance)",
"legendFormat": "RPS - {{instance}}"
}
],
"yAxes": [{
"label": "Requests/sec",
"min": 0
}]
}
Error Rate Panel Configuration
{
"title": "Error Rate",
"type": "singlestat",
"targets": [
{
"expr": "sum(http_requests:error_rate5m)",
"legendFormat": "Error Rate"
}
],
"valueMaps": [
{"value": "null", "text": "N/A"}
],
"thresholds": "0.01,0.05",
"colorBackground": true,
"format": "percentunit"
}
Saturation Panel Configuration
{
"title": "Resource Utilization",
"type": "graph",
"targets": [
{
"expr": "node_cpu:utilization",
"legendFormat": "CPU - {{instance}}"
},
{
"expr": "node_memory:utilization",
"legendFormat": "Memory - {{instance}}"
},
{
"expr": "node_filesystem:utilization",
"legendFormat": "Disk - {{instance}}"
}
],
"yAxes": [{
"label": "Utilization %",
"max": 1,
"min": 0
}],
"thresholds": [{
"value": 0.8,
"colorMode": "critical",
"line": true,
"fill": true
}]
}
Datadog Implementation
For teams using Datadog, here’s how to implement the Golden Four signals using their agent and API:
Datadog Agent Configuration
# datadog.yaml
api_key: YOUR_API_KEY
site: datadoghq.com
logs_enabled: true
process_config:
enabled: true
apm_config:
enabled: true
env: production
# Enable Kubernetes integration
kubernetes_kubelet_host: ${DD_KUBERNETES_KUBELET_HOST}
kubernetes_http_kubelet_port: 10255
kubernetes_https_kubelet_port: 10250
# Custom metrics collection
dogstatsd_config:
enabled: true
bind_host: 0.0.0.0
port: 8125
non_local_traffic: true