Site Reliability Engineers face a fundamental challenge: monitoring complex distributed systems without drowning in metrics noise. Google’s Golden Four signals provide a battle-tested framework for focusing on what truly matters for service reliability. In this comprehensive guide, we’ll walk through practical implementations using Prometheus, Grafana, and Datadog, complete with production-ready configurations and real-world examples.

Prerequisites

Before diving into implementations, ensure you have:

  • Basic understanding of Kubernetes and container orchestration
  • Familiarity with Prometheus metrics and PromQL queries
  • Access to a Kubernetes cluster (kind, minikube, or cloud-managed)
  • Basic knowledge of HTTP status codes and API design principles

Estimated implementation time: 2-4 hours depending on your existing monitoring setup.

Understanding the Golden Four Signals

The Golden Four signals, introduced in Google’s SRE book, represent the minimum viable monitoring for any user-facing system:

1. Latency

Definition: The time it takes to service a request, with important distinction between successful and failed requests.

Why it matters: High latency directly impacts user experience. A 100ms delay can result in 1% drop in sales for e-commerce platforms.

Key considerations:

  • Measure both successful and error response latencies separately
  • Focus on percentiles (P50, P95, P99) rather than averages
  • Different request types may have different latency expectations

2. Traffic

Definition: A measure of how much demand is being placed on your system, typically measured in HTTP requests per second.

Why it matters: Understanding traffic patterns helps with capacity planning and identifying unusual behavior patterns.

Key considerations:

  • Measure by request type, endpoint, or service
  • Consider both inbound and outbound traffic for microservices
  • Account for different traffic patterns (batch vs. real-time)

3. Errors

Definition: The rate of requests that fail, either explicitly (HTTP 500s) or implicitly (HTTP 200 with wrong content).

Why it matters: Error rates directly correlate with user satisfaction and can indicate underlying system issues.

Key considerations:

  • Distinguish between client errors (4xx) and server errors (5xx)
  • Include application-level errors, not just HTTP errors
  • Consider partial failures in distributed systems

4. Saturation

Definition: How “full” your service is, measuring the utilization of your most constrained resource.

Why it matters: Saturation often predicts performance problems before they manifest as latency or errors.

Key considerations:

  • Identify your bottleneck resource (CPU, memory, disk I/O, network)
  • Set thresholds before performance degrades (typically 80-85%)
  • Consider both current utilization and growth trends

Architecture Overview

Our implementation uses a modern observability stack:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Application   │───▶│   Prometheus     │───▶│    Grafana      │
│   (Metrics)     │    │   (Collection)   │    │ (Visualization) │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         │                       ▼                       │
         │              ┌──────────────────┐             │
         │              │  AlertManager    │             │
         │              │  (Alerting)      │             │
         │              └──────────────────┘             │
         │                                                │
         ▼                                                ▼
┌─────────────────┐                            ┌─────────────────┐
│   Application   │                            │   Runbooks &    │
│     Logs        │                            │   Dashboards    │
└─────────────────┘                            └─────────────────┘

Prometheus Implementation

Basic Metrics Collection

First, let’s set up Prometheus to collect the Golden Four signals. Here’s our complete Prometheus configuration:

# prometheus-config.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerting-rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
    - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
    - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
      action: keep
      regex: default;kubernetes;https

  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
    - role: node
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
      target_label: __address__
    - action: labelmap
      regex: __meta_kubernetes_pod_label_(.+)
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: kubernetes_namespace
    - source_labels: [__meta_kubernetes_pod_name]
      action: replace
      target_label: kubernetes_pod_name

  - job_name: 'application-metrics'
    static_configs:
    - targets: ['your-app:8080']
    metrics_path: '/metrics'
    scrape_interval: 5s

Recording Rules for Golden Signals

Create recording rules to pre-compute common Golden Four signal queries:

# recording-rules.yml
groups:
  - name: golden_signals
    interval: 30s
    rules:
      # Latency Rules
      - record: http_request_duration_seconds:p50
        expr: histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, instance, method, le))
      
      - record: http_request_duration_seconds:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, instance, method, le))
      
      - record: http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, instance, method, le))

      # Traffic Rules
      - record: http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, instance, method, status)
      
      - record: http_requests:rate1h
        expr: sum(rate(http_requests_total[1h])) by (job, instance, method, status)

      # Error Rate Rules
      - record: http_requests:error_rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, instance) / sum(rate(http_requests_total[5m])) by (job, instance)
      
      - record: http_requests:error_rate1h
        expr: sum(rate(http_requests_total{status=~"5.."}[1h])) by (job, instance) / sum(rate(http_requests_total[1h])) by (job, instance)

      # Saturation Rules
      - record: node_cpu:utilization
        expr: 1 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))
      
      - record: node_memory:utilization
        expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
      
      - record: node_filesystem:utilization
        expr: 1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})

Application Metrics Integration

For your applications to expose Golden Four signals, implement these metrics in your code. Here’s a Go example using Prometheus client library:

package main

import (
    "net/http"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    // Latency - HTTP request duration histogram
    httpDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "Duration of HTTP requests in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint", "status_code"},
    )
    
    // Traffic - HTTP request counter
    httpRequests = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status_code"},
    )
    
    // Saturation - In-flight requests gauge
    httpInFlight = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "http_requests_in_flight",
            Help: "Current number of HTTP requests being processed",
        },
    )
)

func init() {
    prometheus.MustRegister(httpDuration, httpRequests, httpInFlight)
}

func instrumentHandler(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        httpInFlight.Inc()
        defer httpInFlight.Dec()
        
        // Capture response status
        rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
        
        next(rw, r)
        
        duration := time.Since(start).Seconds()
        status := string(rw.statusCode)
        
        httpDuration.WithLabelValues(r.Method, r.URL.Path, status).Observe(duration)
        httpRequests.WithLabelValues(r.Method, r.URL.Path, status).Inc()
    }
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

Grafana Dashboard Implementation

Create a comprehensive Grafana dashboard that visualizes all four Golden Signals. The dashboard JSON is available in the accompanying grafana-dashboard.json file, but here are the key panels:

Latency Panel Configuration

{
  "title": "Request Latency Percentiles",
  "type": "graph",
  "targets": [
    {
      "expr": "http_request_duration_seconds:p50",
      "legendFormat": "p50 - {{instance}}"
    },
    {
      "expr": "http_request_duration_seconds:p95", 
      "legendFormat": "p95 - {{instance}}"
    },
    {
      "expr": "http_request_duration_seconds:p99",
      "legendFormat": "p99 - {{instance}}"
    }
  ],
  "yAxes": [{
    "label": "Seconds",
    "logBase": 1,
    "min": 0
  }],
  "alert": {
    "conditions": [{
      "query": {"queryType": "", "refId": "B"},
      "reducer": {"type": "last", "params": []},
      "evaluator": {"params": [0.5], "type": "gt"}
    }],
    "executionErrorState": "alerting",
    "frequency": "10s",
    "handler": 1,
    "name": "High Latency Alert",
    "noDataState": "no_data"
  }
}

Traffic Panel Configuration

{
  "title": "Request Rate",
  "type": "graph",
  "targets": [
    {
      "expr": "sum(http_requests:rate5m) by (instance)",
      "legendFormat": "RPS - {{instance}}"
    }
  ],
  "yAxes": [{
    "label": "Requests/sec",
    "min": 0
  }]
}

Error Rate Panel Configuration

{
  "title": "Error Rate",
  "type": "singlestat",
  "targets": [
    {
      "expr": "sum(http_requests:error_rate5m)",
      "legendFormat": "Error Rate"
    }
  ],
  "valueMaps": [
    {"value": "null", "text": "N/A"}
  ],
  "thresholds": "0.01,0.05",
  "colorBackground": true,
  "format": "percentunit"
}

Saturation Panel Configuration

{
  "title": "Resource Utilization",
  "type": "graph", 
  "targets": [
    {
      "expr": "node_cpu:utilization",
      "legendFormat": "CPU - {{instance}}"
    },
    {
      "expr": "node_memory:utilization",
      "legendFormat": "Memory - {{instance}}"
    },
    {
      "expr": "node_filesystem:utilization",
      "legendFormat": "Disk - {{instance}}"
    }
  ],
  "yAxes": [{
    "label": "Utilization %",
    "max": 1,
    "min": 0
  }],
  "thresholds": [{
    "value": 0.8,
    "colorMode": "critical",
    "line": true,
    "fill": true
  }]
}

Datadog Implementation

For teams using Datadog, here’s how to implement the Golden Four signals using their agent and API:

Datadog Agent Configuration

# datadog.yaml
api_key: YOUR_API_KEY
site: datadoghq.com

logs_enabled: true
process_config:
  enabled: true

apm_config:
  enabled: true
  env: production

# Enable Kubernetes integration
kubernetes_kubelet_host: ${DD_KUBERNETES_KUBELET_HOST}
kubernetes_http_kubelet_port: 10255
kubernetes_https_kubelet_port: 10250

# Custom metrics collection
dogstatsd_config:
  enabled: true
  bind_host: 0.0.0.0
  port: 8125
  non_local_traffic: true