Implementing the Golden Four Signals: A Practical Guide to SRE Monitoring

Site Reliability Engineers face a fundamental challenge: monitoring complex distributed systems without drowning in metrics noise. Google’s Golden Four signals provide a battle-tested framework for focusing on what truly matters for service reliability. In this comprehensive guide, we’ll walk through practical implementations using Prometheus, Grafana, and Datadog, complete with production-ready configurations and real-world examples.

Prerequisites

Before diving into implementations, ensure you have:

Basic understanding of Kubernetes and container orchestration
Familiarity with Prometheus metrics and PromQL queries
Access to a Kubernetes cluster (kind, minikube, or cloud-managed)
Basic knowledge of HTTP status codes and API design principles

Estimated implementation time: 2-4 hours depending on your existing monitoring setup.

Understanding the Golden Four Signals

The Golden Four signals, introduced in Google’s SRE book, represent the minimum viable monitoring for any user-facing system:

1. Latency

Definition: The time it takes to service a request, with important distinction between successful and failed requests.

Why it matters: High latency directly impacts user experience. A 100ms delay can result in 1% drop in sales for e-commerce platforms.

Key considerations:

Measure both successful and error response latencies separately
Focus on percentiles (P50, P95, P99) rather than averages
Different request types may have different latency expectations

2. Traffic

Definition: A measure of how much demand is being placed on your system, typically measured in HTTP requests per second.

Why it matters: Understanding traffic patterns helps with capacity planning and identifying unusual behavior patterns.

Key considerations:

Measure by request type, endpoint, or service
Consider both inbound and outbound traffic for microservices
Account for different traffic patterns (batch vs. real-time)

3. Errors

Definition: The rate of requests that fail, either explicitly (HTTP 500s) or implicitly (HTTP 200 with wrong content).

Why it matters: Error rates directly correlate with user satisfaction and can indicate underlying system issues.

Key considerations:

Distinguish between client errors (4xx) and server errors (5xx)
Include application-level errors, not just HTTP errors
Consider partial failures in distributed systems

4. Saturation

Definition: How “full” your service is, measuring the utilization of your most constrained resource.

Why it matters: Saturation often predicts performance problems before they manifest as latency or errors.

Key considerations:

Identify your bottleneck resource (CPU, memory, disk I/O, network)
Set thresholds before performance degrades (typically 80-85%)
Consider both current utilization and growth trends

Architecture Overview

Our implementation uses a modern observability stack:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Application   │───▶│   Prometheus     │───▶│    Grafana      │
│   (Metrics)     │    │   (Collection)   │    │ (Visualization) │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         │                       ▼                       │
         │              ┌──────────────────┐             │
         │              │  AlertManager    │             │
         │              │  (Alerting)      │             │
         │              └──────────────────┘             │
         │                                                │
         ▼                                                ▼
┌─────────────────┐                            ┌─────────────────┐
│   Application   │                            │   Runbooks &    │
│     Logs        │                            │   Dashboards    │
└─────────────────┘                            └─────────────────┘

Prometheus Implementation

Basic Metrics Collection

First, let’s set up Prometheus to collect the Golden Four signals. Here’s our complete Prometheus configuration:

# prometheus-config.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerting-rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
    - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
    - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
      action: keep
      regex: default;kubernetes;https

  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
    - role: node
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
      target_label: __address__
    - action: labelmap
      regex: __meta_kubernetes_pod_label_(.+)
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: kubernetes_namespace
    - source_labels: [__meta_kubernetes_pod_name]
      action: replace
      target_label: kubernetes_pod_name

  - job_name: 'application-metrics'
    static_configs:
    - targets: ['your-app:8080']
    metrics_path: '/metrics'
    scrape_interval: 5s

Recording Rules for Golden Signals

Create recording rules to pre-compute common Golden Four signal queries:

# recording-rules.yml
groups:
  - name: golden_signals
    interval: 30s
    rules:
      # Latency Rules
      - record: http_request_duration_seconds:p50
        expr: histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, instance, method, le))
      
      - record: http_request_duration_seconds:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, instance, method, le))
      
      - record: http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, instance, method, le))

      # Traffic Rules
      - record: http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, instance, method, status)
      
      - record: http_requests:rate1h
        expr: sum(rate(http_requests_total[1h])) by (job, instance, method, status)

      # Error Rate Rules
      - record: http_requests:error_rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, instance) / sum(rate(http_requests_total[5m])) by (job, instance)
      
      - record: http_requests:error_rate1h
        expr: sum(rate(http_requests_total{status=~"5.."}[1h])) by (job, instance) / sum(rate(http_requests_total[1h])) by (job, instance)

      # Saturation Rules
      - record: node_cpu:utilization
        expr: 1 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))
      
      - record: node_memory:utilization
        expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
      
      - record: node_filesystem:utilization
        expr: 1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})

Application Metrics Integration

For your applications to expose Golden Four signals, implement these metrics in your code. Here’s a Go example using Prometheus client library:

package main

import (
    "net/http"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    // Latency - HTTP request duration histogram
    httpDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "Duration of HTTP requests in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint", "status_code"},
    )
    
    // Traffic - HTTP request counter
    httpRequests = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status_code"},
    )
    
    // Saturation - In-flight requests gauge
    httpInFlight = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "http_requests_in_flight",
            Help: "Current number of HTTP requests being processed",
        },
    )
)

func init() {
    prometheus.MustRegister(httpDuration, httpRequests, httpInFlight)
}

func instrumentHandler(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        httpInFlight.Inc()
        defer httpInFlight.Dec()
        
        // Capture response status
        rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
        
        next(rw, r)
        
        duration := time.Since(start).Seconds()
        status := string(rw.statusCode)
        
        httpDuration.WithLabelValues(r.Method, r.URL.Path, status).Observe(duration)
        httpRequests.WithLabelValues(r.Method, r.URL.Path, status).Inc()
    }
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

Grafana Dashboard Implementation

Create a comprehensive Grafana dashboard that visualizes all four Golden Signals. The dashboard JSON is available in the accompanying grafana-dashboard.json file, but here are the key panels:

Latency Panel Configuration

{
  "title": "Request Latency Percentiles",
  "type": "graph",
  "targets": [
    {
      "expr": "http_request_duration_seconds:p50",
      "legendFormat": "p50 - {{instance}}"
    },
    {
      "expr": "http_request_duration_seconds:p95", 
      "legendFormat": "p95 - {{instance}}"
    },
    {
      "expr": "http_request_duration_seconds:p99",
      "legendFormat": "p99 - {{instance}}"
    }
  ],
  "yAxes": [{
    "label": "Seconds",
    "logBase": 1,
    "min": 0
  }],
  "alert": {
    "conditions": [{
      "query": {"queryType": "", "refId": "B"},
      "reducer": {"type": "last", "params": []},
      "evaluator": {"params": [0.5], "type": "gt"}
    }],
    "executionErrorState": "alerting",
    "frequency": "10s",
    "handler": 1,
    "name": "High Latency Alert",
    "noDataState": "no_data"
  }
}

Traffic Panel Configuration

{
  "title": "Request Rate",
  "type": "graph",
  "targets": [
    {
      "expr": "sum(http_requests:rate5m) by (instance)",
      "legendFormat": "RPS - {{instance}}"
    }
  ],
  "yAxes": [{
    "label": "Requests/sec",
    "min": 0
  }]
}

Error Rate Panel Configuration

{
  "title": "Error Rate",
  "type": "singlestat",
  "targets": [
    {
      "expr": "sum(http_requests:error_rate5m)",
      "legendFormat": "Error Rate"
    }
  ],
  "valueMaps": [
    {"value": "null", "text": "N/A"}
  ],
  "thresholds": "0.01,0.05",
  "colorBackground": true,
  "format": "percentunit"
}

Saturation Panel Configuration

{
  "title": "Resource Utilization",
  "type": "graph", 
  "targets": [
    {
      "expr": "node_cpu:utilization",
      "legendFormat": "CPU - {{instance}}"
    },
    {
      "expr": "node_memory:utilization",
      "legendFormat": "Memory - {{instance}}"
    },
    {
      "expr": "node_filesystem:utilization",
      "legendFormat": "Disk - {{instance}}"
    }
  ],
  "yAxes": [{
    "label": "Utilization %",
    "max": 1,
    "min": 0
  }],
  "thresholds": [{
    "value": 0.8,
    "colorMode": "critical",
    "line": true,
    "fill": true
  }]
}

Datadog Implementation

For teams using Datadog, here’s how to implement the Golden Four signals using their agent and API:

Datadog Agent Configuration

# datadog.yaml
api_key: YOUR_API_KEY
site: datadoghq.com

logs_enabled: true
process_config:
  enabled: true

apm_config:
  enabled: true
  env: production

# Enable Kubernetes integration
kubernetes_kubelet_host: ${DD_KUBERNETES_KUBELET_HOST}
kubernetes_http_kubelet_port: 10255
kubernetes_https_kubelet_port: 10250

# Custom metrics collection
dogstatsd_config:
  enabled: true
  bind_host: 0.0.0.0
  port: 8125
  non_local_traffic: true

Prerequisites#

Understanding the Golden Four Signals#

1. Latency#

2. Traffic#

3. Errors#

4. Saturation#

Architecture Overview#

Prometheus Implementation#

Basic Metrics Collection#

Recording Rules for Golden Signals#

Application Metrics Integration#

Grafana Dashboard Implementation#

Latency Panel Configuration#

Traffic Panel Configuration#

Error Rate Panel Configuration#

Saturation Panel Configuration#

Datadog Implementation#

Datadog Agent Configuration#

Prerequisites

Understanding the Golden Four Signals

1. Latency

2. Traffic

3. Errors

4. Saturation

Architecture Overview

Prometheus Implementation

Basic Metrics Collection

Recording Rules for Golden Signals

Application Metrics Integration

Grafana Dashboard Implementation

Latency Panel Configuration

Traffic Panel Configuration

Error Rate Panel Configuration

Saturation Panel Configuration

Datadog Implementation

Datadog Agent Configuration