JP
João Pereira
All reading notes
Cover of Site Reliability Engineering: How Google Runs Production Systems
srereliabilityoperations

Notes on Site Reliability Engineering

Site Reliability Engineering: How Google Runs Production SystemsBetsy Beyer, Chris Jones, Jennifer Petoff & Niall Richard Murphy

Part I: Introduction

Chapter 1 — Introduction

  • SRE is what happens when you ask a software engineer to design an operations team
  • Development teams want to launch features; ops teams want stability — because most outages come from changes, these goals are fundamentally in tension; SRE is the resolution
  • Google places a 50% cap on aggregate "ops" work (tickets, on-call, manual tasks) for all SREs; it's an upper bound, not a target
    • Excess ops work is redirected back to product teams until load drops below 50%
    • Time spent on ops is tracked; remaining time goes to project/engineering work
  • SREs should receive on average a maximum of two events per 8–12 hour on-call shift; consistently fewer than one per shift is also a waste
  • DevOps vs SRE: DevOps is a generalization of SRE principles to a wider range of organizations; SRE is a specific implementation of DevOps with some idiosyncratic extensions
  • SRE team is responsible for: availability, latency, performance, efficiency, change management, monitoring, emergency response, capacity planning
  • Postmortems should be written for all significant incidents regardless of whether a page was triggered
    • Postmortem goal: document what happened, find all contributing causes, assign corrective actions
    • Google operates under a blame-free postmortem culture
  • 100% is the wrong reliability target for basically everything; no user can tell the difference between 100% and 99.999%
    • Once an availability target is set, the remaining tolerance is the error budget
    • Error budget can be spent on anything — features, experiments, rollouts
    • Error budget removes the structural conflict of incentives between dev and SRE; outages become expected parts of the innovation process, not catastrophes
  • Reliability is a function of MTTF and MTTR; Google observed 3x improvement in MTTR when playbooks were introduced vs improvisation
  • ~70% of outages are caused by changes to a live system; best practices: progressive rollouts, fast detection, safe rollbacks
  • Three valid outputs of a monitoring system: alerts (immediate human action needed), tickets (eventual action needed), logs (no action needed)
  • Capacity planning requires accurate organic demand forecasts, incorporation of inorganic demand, and regular load testing
  • A slowdown in a service equates to a loss of capacity; provision to meet a capacity target at a specific response speed

Chapter 2 — The Production Environment at Google, from the Viewpoint of an SRE

  • DC topology: machines → racks → rows → clusters → datacenter → campus
  • Homogeneous environments, common deployment patterns, shared storage, and shared scheduling are what make reliability possible at scale
  • Every service depends on other services; those dependencies define the failure modes
  • Platform standardization is reliability work — it is one of the main ways SRE scales impact without scaling headcount

Part II: Principles

Chapter 3 — Embracing Risk

  • Cost of reliability has two dimensions: cost of redundant resources, and opportunity cost of engineering time spent on risk-reduction instead of features
  • SREs see risk as a continuum; a target is both a minimum and a maximum — you want to exceed it, but not by much
  • Time-based availability: uptime / (uptime + downtime); e.g., 99.99% target = max ~52 minutes downtime per year
  • Aggregate availability: successful requests / total requests; e.g., 2.5M reqs/day at 99.99% target = max 250 errors/day
  • Factors to consider when assessing a service's risk tolerance: required availability level, impact of different failure types, cost of the service, other relevant metrics (latency, etc.)
  • Things to consider when setting an availability target: what will users expect, is the service tied to revenue, is it paid or free, what do competitors offer, is it consumer or enterprise?
  • Cost calculation example: improving 99.9% → 99.99% on a $1M/year service has a value of ~$900; only worthwhile if the improvement costs less
  • Error budget formation: product management defines an SLO; actual uptime is measured by monitoring; the gap is the error budget
    • As long as budget remains, new releases can be made
    • Error budget can block deployments temporarily to pressure reliability focus
    • SREs must have authority to stop launches if budget runs out
    • Sometimes an SLO has to be loosened to allow more innovation
  • Typical factors causing dev/SRE tension: software fault tolerance, testing depth, push frequency, canary duration and size — error budgets make this balance data-driven instead of political

Chapter 4 — Service Level Objectives

  • SLI (Service Level Indicator): quantitative measure of some aspect of service level (latency, error rate, throughput, availability, durability)
  • SLO (Service Level Objective): target value or range for an SLI; structure is SLI ≤ target or lower bound ≤ SLI ≤ upper bound
    • Without explicit SLOs, users form their own beliefs about performance — leading to over-reliance or under-reliance
  • SLA (Service Level Agreement): explicit or implicit contract with consequences for missing SLOs; easy test — "what happens if the SLO isn't met?" — if nothing, it's an SLO not an SLA
  • SLIs by system type:
    • User-facing: availability, latency, throughput
    • Storage: latency, availability, durability
    • Big data: throughput, end-to-end latency
    • All systems: correctness
  • Most metrics are better thought of as distributions, not averages
    • 99th percentile shows plausible worst-case; 50th percentile shows typical case
    • High variance in response times affects user experience disproportionately; some teams focus only on high percentile values
    • Metrics averaged per minute can hide bursts
  • Collect client-side metrics when possible; not measuring at the client misses problems that don't show up server-side
  • Chubby example: Chubby was so reliable that teams stopped designing for its absence; solution was to take it down deliberately when it was too far above its SLO for the quarter
  • Tips for choosing SLO targets:
    • Don't pick a target based on current performance — you might be supporting a system that requires heroic effort
    • Keep it simple; complicated aggregations obscure changes
    • Avoid absolutes (YAGNI)
    • Have as few SLOs as possible; defend each one you pick
    • Perfection can wait — start loose and tighten, not the other way around
  • Keep a safety margin: use internal SLOs (stricter) and external SLOs (looser); don't advertise your internal target externally
  • Don't overachieve: users become dependent on over-performing services; deliberately throttle or take the system offline occasionally to avoid over-reliance
  • SLOs should specify how they're measured and conditions under which they're valid, e.g., "99% of GET RPC calls will complete in < 100ms averaged over 1 minute across all backend servers"
  • Error budget is effectively an SLO for meeting other SLOs; track it daily/weekly; upper management typically looks at monthly or quarterly

Chapter 5 — Eliminating Toil

  • Toil is work tied to running a production service that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth
    • Overhead (admin work not tied directly to running a service) is different from toil
    • If human judgment is essential, it might not be toil
    • If the service remains in the same state after you finish the task, it was probably toil
  • Typical SRE activities: software engineering, systems engineering, toil, overhead
  • Top sources of toil: non-urgent service-related messages/email, on-call response, releasing
  • Every SRE needs to spend at least 50% of their time on engineering work (when averaged over a few quarters — toil tends to be spiky)
  • If individual SREs report excessive toil, it's a signal for managers to redistribute load more evenly and help those SREs find engineering projects
  • Toil in small amounts can be tolerable; some people gravitate toward it
  • Toil becomes toxic in large quantities — leads to career stagnation, burnout, boredom, sets a precedent for loading more toil onto SREs, promotes attrition, and causes breach of faith with new hires promised project work

Chapter 6 — Monitoring Distributed Systems

  • Monitoring: collecting, processing, aggregating, and displaying real-time quantitative data about a system
  • White-box monitoring: based on metrics exposed by internals (logs, JVM profiling, HTTP handlers); Black-box monitoring: testing externally visible behavior as a user would see it
  • Why monitor: analyze long-term trends, compare over time or experiment groups, alerting, build dashboards, conduct ad hoc retrospective analysis
  • Never trigger an alert simply because "something seems a bit weird" (security auditing on very narrow scopes is an exception)
  • When pages occur too frequently, engineers second-guess, skim, or ignore alerts — including real ones masked by noise
  • Avoid "magic" systems that try to learn thresholds or automatically detect causality (rules detecting unexpected changes in end-user request rates are a valid counter-example)
  • Complex dependency hierarchies ("if DB is slow, alert for DB; otherwise alert for website") only work for very stable system parts
  • Four golden signals:
    • Latency: time to service a request; distinguish latency of successful requests vs. failed ones; increases in latency are an early indicator of saturation
    • Traffic: how much demand is being placed on the system (requests/s, broken out by request type)
    • Errors: explicit (500s), implicit (200 with wrong content), or by policy (response over 1s = error if you've committed to 1s)
    • Saturation: how "full" the service is; emphasizes the most constrained resource; many systems degrade before 100% utilization so having a utilization target is essential; also covers predictions of impending saturation ("DB will fill in 4 hours")
  • If you measure all four golden signals and page when one is problematic, your service will be at least decently covered
  • For tail latency: collect request counts bucketed by latencies (histograms) rather than raw latencies
  • Alerting rules for humans should be simple, predictable, reliable, and represent a clear failure
  • Questions to ask before creating an alert:
    • Does this detect an otherwise-undetected condition that is urgent, actionable, and user-visible?
    • Will I ever be able to ignore this alert knowing it's benign?
    • Can I take action in response? Is that action urgent, or can it wait until morning?
    • Are other people already being paged for this?
  • On pages: every page should be actionable; every page response should require intelligence; pages with rote algorithmic responses are a red flag; pages should be about novel problems
  • Spend more effort catching symptoms than causes; only worry about definite, imminent causes
  • In Google's experience: simple collection + aggregation + alerting + dashboards works well; add complexity only when needed
  • Periodical reviews of page frequency done with management in quarterly reports (target: a couple of pages per shift)
  • Often, sheer heroic effort can achieve high availability short-term — but a controlled short-term hit is usually a better long-run trade than sustained burnout

Chapter 7 — The Evolution of Automation at Google

  • "Automation is a force multiplier, not a panacea"

  • Automation is meta-software: software to act on software

  • Doing automation thoughtlessly can create as many problems as it solves

  • Value of automation: consistency (very few humans act with equal consistency every time), platform extensibility, reduced MTTR, faster non-repair actions (failovers, traffic switching), time savings — decoupling the operator from the operation is powerful

  • Hierarchy of automation maturity:

    1. No automation
    2. Externally maintained, system-specific automation (script in an SRE's home folder)
    3. Externally maintained, generic automation (documented for the team)
    4. Internally maintained, system-specific automation (versioned to the system's repo)
    5. Systems that don't need any automation (the goal)
  • Infrequently run automation is fragile

  • Relieving teams from ops responsibility can remove their incentive to reduce tech debt; product managers not affected by low-quality automation will always prioritize new features

  • Automation failure risk: when automation covers more and more daily activities, human operators lose direct contact with the system; when automation fails, humans may no longer be able to operate it — this is unavoidable in sufficiently autonomous systems, and must be accounted for

  • Case study (Ads Database): failovers automated, outage no longer paged a human; total operational maintenance cost dropped ~95%; up to 60% of hardware utilization freed

  • Case study (Cluster turnups): early automation was an initial win, but free-form scripts became technical debt; Prodtest (Python unit test framework extended for real-world services) created a chain of tests that could validate a service's configuration across all clusters

Chapter 8 — Release Engineering

  • Release engineering is a distinct discipline: release engineers work with SWEs and SREs to define how software is released, from version control through compilation, testing, packaging, and deployment
  • High velocity models: some teams do hourly builds with deploy based on test results; others use "push on green" (every build that passes tests goes to production)
  • Hermetic builds: building the same revision always produces identical results; self-contained including the compiler version; allows cherry-picking fixes against old revisions to fix production software
  • All code lives in the main branch; releases are branched off; fixes flow from main to the release branch via cherry-pick; branches never merge back
  • CI creates an audit trail: tests ran, tests passed
  • Config management: deceptively simple, a major source of instability; all schemes should involve source control and strict review
    • Option: use mainline for config (decouples binary releases from config changes)
    • Option: include config files in the same package as the binary (simple, but tightly coupled)
    • Option: package config separately using the same hermetic principle as code
  • Six gated operations requiring approval: source code change, release process action definitions, new release creation, integration proposal approval, release deployment, build configuration modifications
  • Packages are named, versioned with a unique hash, and signed
  • Budget for release engineering resources at the beginning of the product development cycle — it's cheaper to do it now than later
  • Common questions every team needs to answer: how to handle package versioning, CI or CD, release cadence, config management policies, release metrics

Chapter 9 — Simplicity

  • "At the end of the day, our job is to keep agility and stability in balance in the system"
  • "The price of reliability is the pursuit of the utmost simplicity"
  • "Unlike a detective story, the lack of excitement, suspense, and puzzles is actually a desirable property of source code"
  • Reliable systems can increase agility: reliable rollouts make it easier to link changes to bugs
  • Essential vs accidental complexity: SREs should push back when accidental complexity is introduced
  • Code is a liability, not an asset; remove dead code and other bloat
  • Commented-out code is an anti-pattern; forever-gated feature flags are an anti-pattern (flags should be actively rehearsed and removed)
  • Smaller APIs are easier to test and more reliable; avoid misc/util classes
  • Small releases are easier to debug and measure; you can't tell what happened if 100 changes were released together
  • No monitoring: you're blind; SREs don't go on-call for the sake of it, they do it to stay in touch with how systems work and fail
  • Managing incidents effectively reduces impact and limits outage-induced anxiety; blameless postmortems are the first step to understanding what went wrong

Part III: Practices

Chapter 10 — Practical Alerting

  • Monitoring a very large distributed system presents challenges: vast number of components, need for low maintenance burden
  • Borgmon (Google's internal monitoring system, conceptually similar to Prometheus): a programmable calculator with syntactic sugar for generating alerts using a common data exposition format
  • Time-series data: conceptually a 2D array with time on one axis and items on the other; each series named by a unique set of labels (name=value)
    • Data points are (timestamp, value) stored in chronological lists
    • Data stored in-memory, checkpointed to disk; fixed-size allocation; oldest entries GC'd when full
  • Counters: monotonically non-decreasing variables (km driven, request count); preferred over gauges because they don't lose meaning when events occur between sampling intervals
  • Gauges: any value, doesn't have to be monotonically shifting (fuel remaining, current speed)
  • Labels serve three purposes: define breakdowns of the data itself, define the source of the data (service name, container), indicate locality or aggregation (zone, shard)
  • Alertmanager: can inhibit certain alerts when others are active, deduplicate alerts from multiple instances with same labelsets, fan-in or fan-out alerts based on labelsets
  • White-box monitoring (Borgmon/Prometheus): sees system internals; Black-box monitoring (Prober): looks at system from outside, monitors what the user sees
  • Page-worthy alerts go to on-call rotation; non-page-worthy alerts go to a separate processing queue or as informational data — this distinction is highlighted repeatedly as an Important Detail
  • Rules generating alerts for humans should be simple, represent clear failures, and require intelligence to respond to

Chapter 11 — Being On-Call

  • On-call = being available to step in, reacting within a specific time bound (minutes or hours depending on SLA)
  • Typical response times: 5 minutes for user-facing or time-critical tasks; 30 minutes for less time-sensitive
  • When a page arrives: acknowledge, triage, escalate if necessary
  • Non-paging events are less urgent but on-call engineers should vet them during business hours
  • Primary and secondary roles: primary for pages; secondary falls through for primary, handles non-paging events, escalation support
  • Balancing on-call quantity: SREs should spend ≥50% of time on engineering; of remaining time, no more than 25% on on-call
  • Balancing on-call quality: incident handling (root cause, remediation, postmortem, bug fix) takes ~6 hours; therefore max ~2 incidents per 12-hour shift; strive for a very flat distribution of pages with median 0
  • Night shifts degrade health; multi-site teams eliminate night shifts by following the sun; caveat: significant communication and handoff overhead
  • SRE managers must keep quantity and quality balanced
  • Most important on-call resources: clear escalation paths, well-defined incident management procedures, blameless postmortem culture
  • All paging events should be actionable; silencing noisy non-actionable alerts reduces fatigue
  • If there's more than one alert for one incident, strive for a 1:1 alert-to-incident ratio
  • If too many pages occur, give the pager to the developers owning those services and work with them until alert quality returns to standard; feature development halts until this is resolved
  • Compensation: extra pay or time-off for on-call shifts; capped at a proportion of salary to incentivize involvement while limiting burnout
  • Operational underload is also a problem: teams should be sized so every engineer is on-call once or twice a quarter, to stay in touch with production
  • Wheel of Misfortune helps hone SRE capabilities
  • Two modes of thinking under pressure: intuitive/rapid action vs rational/deliberate cognitive function; the latter leads to better outcomes during incidents
  • SRE teams can be loaned to overloaded teams temporarily; measure overload symptoms (daily tickets, paging events per shift) explicitly

Chapter 12 — Effective Troubleshooting

  • Troubleshooting is the application of the hypothetico-deductive method: iterate hypotheses until one holds
  • Troubleshooting is learnable; expertise comes from investigating failures, not just understanding normal operation
  • Ideally a problem report gives the top-level symptom; start drilling down into telemetry and logs, narrow down culprits, exclude parts of the system (bisection is a useful tactic), identify contributing factors
  • Two ways to test hypotheses: compare observed state against theory to find confirming/disconfirming evidence, or treat the system (change something in a controlled way and observe)
  • Common troubleshooting pitfalls:
    • Looking at irrelevant symptoms (wild goose chases)
    • Misunderstanding system dynamics (inputs, behavior, outputs)
    • Coming up with wildly improbable theories
    • Hunting down spurious correlations and coincidences
    • Confusing correlation with causation
  • Always prefer simple explanations; the four golden signals are useful scaffolding for building simple explanations
  • An effective problem report contains: expected behavior, reproduction steps, consistent form, and exists somewhere searchable
  • Stop the bleeding first — make the software work before investigating root cause; preserve earlier evidence of the incident for later
  • Structured logs are important for retrospective analysis; pass trace IDs using a common standard through all layers
  • Design systems with well-understood and observable interfaces between components; observability-driven engineering makes troubleshooting sessions dramatically shorter

Chapter 13 — Emergency Response

  • Don't panic: you're not alone, the sky is not falling, nobody is dying; if you feel overwhelmed, pull in more people (sometimes everyone has to be paged)
  • Follow an incident response process; not following it is itself a contributing cause of incidents
  • Three types of emergencies:
    • Test-induced: planned, proactive ways to break production; failures are controlled and aborted when things go wrong
    • Change-induced: incident stems from deployment or configuration changes
    • Process-induced: incident caused by a process (usually automated) that wreaks havoc (e.g., automation that wipes hard drives)
  • Lessons from test-induced emergency example:
    • Nobody really understood how two systems interacted — review hadn't been good enough despite many eyes
    • Incident response process was not followed, which prevented wider awareness
    • Rollback procedures had not been rehearsed in a test environment — they were broken
    • Now: rollback procedures are tested before any large-scale test
  • "All problems have solutions — a solution exists, even if it's not obvious, especially to the person whose pager is screaming"
  • Involve the person whose actions triggered the incident; they know the most and change-induced emergencies are typically fixed faster with them involved
  • Keep a history of outages: ask hard questions, look for strategic (not just tactical) preventive actions, publish postmortems somewhere everyone can read them, hold people accountable to follow-up actions
  • Until a system has failed, you don't know how it, its upstream systems, or its users will react — don't assume

Chapter 14 — Managing Incidents

  • Recursive separation of responsibilities: delegate distinct roles with clear boundaries
    • Incident Commander: structures the response task force, assigns responsibilities, holds all undelegated roles; most important task is maintaining a living incident document (war room)
    • Ops Lead: works with incident command; only person modifying the system during the incident
    • Communications Lead: public face of the task force; provides periodic updates to the team and stakeholders
    • Planning Lead: handles longer-term issues — bug filing, arranging dinner, tracking handoffs, recording how the system has diverged from normal so it can be reverted
  • A single war room (physical or virtual) is recommended; incident command handoffs must be done loudly and explicitly with explicit acknowledgment from all participants
  • When to declare an incident (declare early rather than late): do you need a second team? Is the outage customer-visible? Has the issue gone unsolved after an hour of concentrated effort?
  • Incident management proficiency degrades when not in regular use
  • Best practices:
    • Prioritize: stop the bleeding, restore service, preserve evidence for the postmortem
    • Prepare: develop and document procedures in advance with incident participants
    • Trust: give full autonomy within each assigned role
    • Introspect: if you feel panicky or overwhelmed, get more support
    • Consider alternatives: periodically re-evaluate whether the current approach still makes sense
    • Practice: use the process routinely so it becomes second nature
    • Rotate roles: encourage every team member to gain familiarity with each role

Chapter 15 — Postmortem Culture: Learning from Failure

  • "The cost of failure is education"
  • Postmortem definition: a written record of an incident, its impact, the actions taken to mitigate or resolve it, the contributing causes, and follow-up actions to prevent recurrence
  • Primary goals: document the incident, understand all contributing causes, take preventive actions to reduce likelihood and/or impact of recurrence
  • Writing a postmortem is not a punishment; it's a learning opportunity; any stakeholder may request one
  • Blamelessness:
    • Must not indict any individual or team for bad or inappropriate behavior
    • Assumes everyone had good intentions and did the right thing with the information they had at the time
    • When done well, leads to investigating why individuals had incomplete/incorrect information
    • When done badly, leads to finger-pointing and shaming — and, critically, to people hiding information in future incidents
  • Postmortem review criteria used at Google: was key incident data collected? Are impact assessments complete? Is the action plan appropriate? Are resulting bug fixes at appropriate priority? Did we share the outcome with relevant stakeholders?
  • An unreviewed postmortem might as well never have existed
  • Tools for introducing postmortem culture: postmortem of the month newsletter, postmortem reading clubs (regular sessions where impactful postmortems are read aloud), Wheel of Misfortune (re-enact a previous postmortem with the original incident commander present)
  • Make writing effective postmortems a rewarded and celebrated practice; even senior leadership should acknowledge and participate (book mentions Larry Page talking about the value of postmortems)
  • Ask for feedback on effectiveness: is the culture supporting your work? Does writing one entail too much toil? What tools would you like to see?

Chapter 16 — Tracking Outages

  • Postmortems provide useful insights for individual services but can miss opportunities with small per-service impact but large horizontal impact
  • The Escalator: Google's in-house PagerDuty equivalent; centralized system tracking ACKs to alerts, notifies others if necessary
  • The Outalator: time-interleaved view of notifications for multiple queues at once; allows annotating incidents, marking annotations as important, silently saving email replies, and combining multiple escalating notifications into a single incident entity
  • A single event often triggers multiple alerts; the ability to group multiple alerts into a single incident is critical
  • Track outages with consistent definitions, user impact, duration, and cause categories; this makes reliability visible enough to influence prioritization

Chapter 17 — Testing for Reliability

  • "If you haven't tried it, assume it's broken"
  • Confidence comes from both past reliability and future reliability; for future predictions to hold, either the system remains completely unchanged or you can confidently describe all changes
  • Passing tests doesn't prove reliability; failing tests generally prove its absence
  • Zero MTTR: a system-level test that detects exactly the same problem monitoring would detect; repairing these bugs by blocking a push is quick and convenient
  • The more bugs caught pre-production (zero MTTR), the higher the MTBF
  • Test types:
    • Unit tests: smallest/simplest; assess a single unit of software (class, function) independent of the larger system
    • Integration tests: assembled component verification; use dependency injection and mocks to test components in isolation
    • System tests: largest scale for undeployed systems; end-to-end functionality
      • Smoke tests: simple but critical behavior; short-circuit additional expensive testing
      • Performance tests: check performance stays acceptable over the lifecycle
      • Regression tests: prevent known bugs from sneaking back; gallery of rogue bugs
    • Stress tests: find the limits of a web service
    • Canary tests: a subset of servers upgraded to a new version/config and left in incubation; not really a test, it's structured user acceptance
  • Canary tests: not necessary to achieve fairness among fractions of user traffic when using exponential rollout
  • CI/CD: optimal when engineers are notified when the build pipeline fails; deblocking pipelines should always be the first priority
  • Config files that change more often than once per application release are a major reliability risk if those changes aren't treated the same as application releases
  • Config file contents are potentially hostile to the interpreter reading them — a potential security threat vector
  • Disaster recovery tools should work "offline" using checkpoint states; they're expected to work with instant consistency, not eventual consistency
  • Statistical techniques like fuzzing and chaos testing aren't necessarily repeatable; improve repeatability using seeded random number generators
  • Key element of site reliability: find each anticipated form of misbehavior and make sure some test reports it
  • SRE tools need to be tested too (tools that retrieve/propagate metrics, predict usage, plan capacity)

Chapter 18 — Software Engineering in SRE

  • Growth rate of SRE-supported services exceeds the growth rate of the SRE organization; one SRE guiding principle is that team size should not scale directly with service growth
  • Team diversity is critical: a mix of traditional software development and systems engineering backgrounds helps prevent blind spots
  • Intent-based capacity planning: specify requirements (intent), not implementation; encode them, autogenerate the allocation plan
    • Ladder of increasingly intent-based planning:

      1. "I want 50 cores in clusters X, Y, Z" — why those?
      2. "I want 50 cores in any 3 clusters in region" — why 50, why 3?
      3. "I want to meet demand with N+2 redundancy" — why N+2?
      4. "I want 5 nines of reliability" — could find N+2 insufficient
    • Greatest gains from going to level 3; some sophisticated services go to level 4

  • Auxon case study (Google's intent-based capacity planning tool):
    • Built by an SRE who was managing capacity in spreadsheets, then formalized into a full product with backlog, SLA, team ownership
    • Inputs: requirements/intent, performance data, demand forecasts, resource supply and pricing
    • Uses a mixed-integer or linear programming solver
    • Key learnings: don't focus on perfection, launch and iterate; a single email doesn't drive adoption — it needs consistent approach, user advocacy, and senior sponsorship; small releases build confidence; don't over-customize for a few big users; avoid 100% adoption rate; "seed team" of generalists + deep-expertise engineers works well
  • SRE software must be designed for scalability, graceful degradation on failure, and easy integration with other infrastructure
  • Good candidate SRE projects: reduce toil, improve existing infrastructure, streamline a complex process, and must fit org-wide objectives
  • SREs who build products should continue working as SREs rather than becoming embedded developers — they dogfood the tools and bring an invaluable operational perspective
  • Guidelines for building SRE software: create a clear message and communicate benefits (SREs are skeptical), evaluate org capabilities, launch and iterate to establish credibility, hold yourself to the same standards as a product team

Chapter 19 — Load Balancing at the Frontend

  • DNS is typically the first layer of load balancing; conceptually simple but many dragons exist
    • Very little control over client behavior; records selected randomly
    • DNS server acts as a caching layer: recursive resolution makes it difficult to find the optimal IP for a given user; responses are cached with TTL
    • DNS alone is insufficient; not the right solution for fine-grained control
  • Better approach: DNS combined with Virtual IP (VIP) addresses
    • Network Load Balancer sits in front; receives packets and routes to available backends
    • Consistent hashing: mapping algorithm that remains stable when backends are added or removed, minimizing disruption to existing connections
    • Simple connection tracking as default; fall back to consistent hashing under pressure
  • Packet forwarding strategies:
    • NAT: assumes a completely stateless fallback mechanism
    • Direct Server Response (DSR) (layer 2 modification): all LBs and backends must be reachable at the data link layer; Google stopped using this
    • Packet encapsulation (GRE): Google started using this; introduces overhead (~24 bytes for IPv4+GRE) that can exceed MTU and require fragmentation
  • Load balancer should always prefer redirecting to the least loaded backend

Chapter 20 — Load Balancing in the Datacenter

  • Backend service: typically 100–1000 processes; ideal goal is perfectly distributed load
  • Lame duck state: backend task is listening on its port and can serve, but explicitly asks clients to stop sending new requests; broadcasts this state to all active clients
    • Main advantage: simplifies clean shutdown; avoids serving errors to requests active on shutting-down tasks
    • Shutdown sequence: scheduler sends SIGTERM → task enters lame duck → clients redirect new requests elsewhere → ongoing requests complete → task exits cleanly (or is killed)
  • Traffic sinkholing: a client sends very large amounts of traffic to an unhealthy task because the unhealthy task returns errors with very low latency, causing the client to increase request rate
    • Fix: tune load balancer to count recent errors as active requests
  • If outgoing request latency grows (e.g., competition for network resources from a noisy neighbor), active request count also grows — can trigger GC
  • When a task restarts, it often requires significantly more resources for a few minutes (initialization cost, cold cache, JIT warmup)
  • Subsetting: clients interact with a limited subset of backends (typically 20–100); reduces connection overhead while maintaining health checking
  • Subset selection algorithms: random (bad utilization) → round-robin (permuted order) → deterministic subsetting (each backend assigned to exactly one client per round)
  • Load balancing policies:
    • Round-robin: 2x difference observed between most and least loaded in practice
    • Least-loaded round-robin: rounds among least-loaded; load measured by connection count; still suboptimal since it's per-client, not global
    • Weighted round-robin: clients maintain capability scores per backend; backends report query rate, error rate, utilization in responses; clients adjust scores periodically; best distribution in practice — recommended

Chapter 21 — Handling Overload

  • Gracefully handling overload is fundamental to running a reliable service
  • Strategy: redirect when possible, serve degraded results when necessary, handle resource errors transparently when all else fails
  • QPS is a poor capacity metric because different queries have vastly different resource costs; better to measure capacity in available resources (CPU time per request is a good normalized measure)
  • When global overload occurs: deliver errors to misbehaving customers, other customers remain unaffected; reject out-of-quota requests quickly
  • Client-side throttling: when most CPU is spent rejecting requests, throttle on the client side
  • Adaptive throttling: each client tracks two values over a 2-minute window: requests (attempted) and accepts (accepted by backend); once requests = K × accepts, client stops sending; this leads to stable overall request rates in practice
  • Request criticality (Google's four tiers):
    • CRITICAL_PLUS: reserved for most critical; serious user-visible impact if they fail
    • CRITICAL: default for production jobs; will cause user-visible impact; services must provision capacity for CRITICAL_PLUS + CRITICAL traffic
    • SHEDDABLE_PLUS: partial unavailability expected; default for batch jobs
    • SHEDDABLE: frequent partial and occasional full unavailability expected
    • Criticality propagates through RPC calls (same criticality level is used for all upstream calls)
    • Only reject requests of a given criticality if already rejecting all requests of lower criticalities
  • Overload protection at Google is based on utilization (CPU rate / total CPUs reserved, executor load average, combined target thresholds); as threshold is reached, requests are rejected based on criticality
  • Overload errors: if large subset of DC is overloaded, don't retry (errors should bubble up); if small subset is overloaded, prefer immediate retry
  • Request retries: from the load balancer's perspective, retries are indistinguishable from new requests; can be organic load balancing
    • Per-request retry budget (max 3 at Google)
    • Per-client retry budget (track retry ratio; only retry if below ~10%)
    • Return "overloaded; don't retry" error response when a histogram reveals significant retry volume
    • Consider having a server-wide retry budget
  • Handling burst load: expose load to cross-datacenter load balancing algorithm; use a separate proxy backend for batch jobs to shield fan-outs from user-facing services
  • Common mistake: assuming an overloaded service should turn down and stop accepting all traffic; instead, accept as much as possible and only shed load as a last resort

Chapter 22 — Addressing Cascading Failures

  • "If at first you don't succeed, back off exponentially." + "Why do people always forget to add a little jitter?"
  • Cascading failure: failure that grows over time as a result of positive feedback; can occur when part of a system fails, increasing the probability that other parts also fail
  • Most common cause: overload
  • Resource types that can be exhausted: CPU, memory, threads, file descriptors, dependencies among resources
  • CPU exhaustion secondary effects: increased in-flight requests, longer queues, thread starvation, reduced CPU cache benefits, health check failures
  • Memory exhaustion secondary effects: dying tasks, increased GC rate in Java (GC death spiral: less CPU → slower requests → increased RAM usage → more GC → even less CPU), reduced cache hit rates, thread starvation
  • Thread starvation: can directly cause errors, health check failures; if threads added without upper bound, thread overhead uses too much RAM; also risks running out of process IDs
  • File descriptor exhaustion: inability to initialize network connections → health check failures
  • Load balancing policies that avoid servers serving errors exacerbate problems (snowball effect on remaining servers)
  • Strategies for avoiding server overload: load test capacity limits and test failure mode for overload; serve degraded results; instrument to reject requests when overloaded; have higher-level systems reject requests (reverse proxy, load balancer, task); capacity planning
  • Queue management: keep queue size ≤50% of thread pool size for steady-load services; for bursty workloads, size based on thread count, processing time, and burst size and frequency; consider LIFO queuing or controlled delay (CoDel) algorithm
  • Load shedding: drop a proportion of traffic as server approaches overload; per-task throttling based on CPU, memory, or queue length; return 503 when too many requests are in-flight
  • Graceful degradation: reduce amount of work (search in-memory cache instead of DB); keep the degradation path simple; test it regularly (a code path you don't exercise will be broken); design a way to turn it off
  • Retry guidelines: always use randomized exponential backoff; limit retries per request; avoid retrying at multiple levels (amplifies load catastrophically); separate retriable from non-retriable errors; return a specific "overloaded; don't retry" status; server-wide retry budgets
  • RPC deadlines: essential to prevent zombie requests consuming resources; propagate deadlines top-down (all downstream RPCs share the same absolute deadline); set an upper bound on outgoing deadlines; deadlines several orders of magnitude longer than mean request latency are usually bad; check deadline before continuing at each processing stage
  • Propagate cancellations: notify servers in the call chain that their work is no longer needed; some systems "hedge" requests and cancel the rest when one responds
  • Cold start issues: processes are slower right after starting (initialization, JIT, deferred class loading, cold cache); when adding load to a cluster, increase gradually
  • Always go downward in the stack; avoid intra-layer communication; communications within a layer are susceptible to distributed deadlocks
  • Triggering conditions for cascading failures: process death (Query of Death, assertion failures), process updates, new rollouts (config or infra changes), organic growth (usage exceeded capacity estimate), planned drains or turndowns
  • Depending on slack CPU as a safety net is dangerous; ensure load tests stay within committed resource limits
  • Testing for cascading failures: test until it breaks; consider both gradual and impulse load patterns; test each component separately; track state between interactions; be careful testing in production
  • Immediate steps to address cascading failures: increase resources (may not be sufficient alone), stop health check failures, restart servers (especially GC death spirals or deadlocks), drop traffic (last resort — let 1% through only), enter degraded mode, eliminate batch load, remove bad traffic

Chapter 23 — Managing Critical State: Distributed Consensus for Reliability

  • Distributed consensus problem: reaching agreement among a group of processes connected by an unreliable network — one of the most fundamental concepts in distributed computing
  • Questions requiring consensus: who is the leader? What processes are in the group? Has a message been committed to the queue? Does a process hold a lease? What is the value for a key?
  • Whenever you see leader election, critical shared state, or distributed locking: use distributed consensus systems that have been formally proven and tested — don't roll your own
  • CAP theorem: a distributed system cannot simultaneously have all three of: consistent views of data at each node, availability of data at each node, tolerance to network partitions
  • FLP impossibility: no asynchronous distributed consensus algorithm can guarantee progress in the presence of an unreliable network; mitigated by having sufficient healthy replicas and network connectivity (plus backoff jitter to avoid dueling proposers)
  • ACID vs BASE: BASE (Basically Available, Soft state, Eventual consistency) enables higher throughput at the cost of stronger consistency guarantees; eventual consistency can produce surprising results especially with clock drift or network partitions
  • Case study 1 (split-brain): heartbeats can't be used to solve leader election; slow or packet-dropping network can cause both nodes to issue kill commands to each other
  • Case study 2 (human intervention failover): human escalation scales poorly; if the network is so badly degraded that consensus can't elect a primary, a human is not better positioned to elect one either
  • Case study 3 (faulty group membership): gossip-protocol-based cluster formation can lead to split-brain in a network partition, with both sides electing a master and accepting writes
  • Consensus algorithms: Paxos (original), Multi-Paxos, Raft, Zab, Mencius
  • Paxos overview: sequence of proposals accepted by a majority; each proposal has a unique sequence number (strict ordering); acceptors agree only if no higher sequence number seen; proposer commits by sending value to acceptors when majority agrees; acceptors must journal to disk when accepting; two different values cannot be committed for the same proposal (any two majorities overlap at ≥1 node)
  • Replicated State Machine (RSM): executes the same set of operations in the same order on several processes; fundamental building block; any deterministic program can be implemented as highly available service as an RSM
  • Timestamps are highly problematic in distributed systems; use distributed consensus for ordering instead
  • Barriers: primitives that block a group of processes from proceeding until a condition is met; split distributed computation into logical phases; can be implemented as an RSM (Zookeeper supports barriers)
  • Locks should be used with timeouts to prevent deadlocks; supported in RSM
  • Queueing-based systems: tolerate failure/loss of worker nodes; use lease systems to ensure claimed tasks are processed; implementing the queue as an RSM makes the system far more robust
  • Atomic broadcast: messages received by all participants reliably and in the same order — an incredibly powerful primitive
  • Multi-Paxos: strong leader process enables only 1 round trip to reach consensus; backoff jitter and timeouts necessary to avoid dueling proposers
  • For read-heavy workloads: read-only consensus operation, read from replica guaranteed to be most up-to-date (stable leader can provide this), or quorum leases (strongly consistent local reads at the cost of some write performance)
  • Two physical constraints on performance: network round-trip time and lead time for writing to persistent storage
  • Conventional wisdom that consensus algorithms can't be used for high-throughput low-latency is false — proven extremely effective in practice at Google
  • Minimum replicas for non-Byzantine failures: 3 (2-node quorums cannot tolerate any failure); adding a replica to a majority quorum can decrease availability
  • Monitor consensus systems closely: number and status of replicas, lagging replicas, whether a leader exists, rate of leader changes (too rapid = flappiness, sudden decrease = serious bug), consensus transaction number (is the system making progress?), throughput and latency
  • "If you remember nothing else from this chapter": know the problems that distributed consensus can solve and the types of problems that arise from using ad hoc methods like heartbeats

Chapter 24 — Distributed Periodic Scheduling with Cron

  • Cron: Unix tool for launching arbitrary periodic jobs at user-defined times or intervals
  • Simple cron failure domain is one machine; only state needing persistence across restarts is crontab configuration; launches are fire-and-forget so launch tracking is not needed (exception: anacron-style catch-up for missed launches)
  • Cron jobs come in shapes: idempotent (GC, cleanups), side-effectful (email newsletters), time-pressured or not; skipping a cron job is generally better than risking a double run
  • Hosting cron on a single machine is a reliability catastrophe; decouple the cron service from machines
  • Two options for tracking state in distributed cron: external distributed storage (better for large blobs, but small writes are expensive and high-latency) vs. small local state stored as part of the cron service (no extra dependencies, but risk of data loss)
  • Paxos for distributed cron: strong consistency guarantees; leader replica is the only one that actively launches jobs; completion of a launch synced to all replicas; leader election must complete within 1 minute to avoid missing launches
  • Every cron job launch has two sync points: when the launch happens and when it finishes — these delimit the launch
  • To reduce missed/double launches when a leader dies: all operations should be idempotent (use ahead-of-time known job names), or have observability to see if launch requests all succeeded
  • Log compaction: the state change log must be compacted (snapshots work well); can store locally (fast but possible data loss) or externally (not desirable due to small write cost); Paxos helps recover from single-machine log loss via replicas
  • Thundering herd problem: many concurrent cron jobs spawning HTTP calls can cause spikes; solution: allow ? in crontab schedule fields so the cron system picks the value randomly, effectively adding jitter

Chapter 25 — Data Processing Pipelines

  • Data pipeline pattern: read data → transform → output; historically: co-routines, DTSS communication files, Unix pipes, ETL pipelines
  • Simple one-phase pipelines: periodic or continuous transformation on big data; multiphase pipelines chain programs in series — organized for ease of understanding rather than operational efficiency
  • Periodic pipelines are generally stable when there are sufficient workers AND execution demand is within computation capacity; fragile when growth and changes strain resources
  • "Embarrassingly parallel" algorithms cut workloads into chunks per machine; end-to-end runtime is capped by the largest customer's runtime
  • Hanging chunk problem: uneven resource distribution across cluster machines; typical user code waits for total computation to complete — one slow chunk delays everything; responding after detection (e.g., killing the job) can make things worse by restarting from scratch
  • Excessive batch scheduler usage places jobs at risk of preemption when cluster load is high (other users starved of batch resources)
  • Moiré load pattern: two or more pipelines run simultaneously, their execution sequences occasionally overlap, simultaneously consuming a shared resource; less common when load arrives more evenly; best observed through shared resource usage
  • Thundering herd problem in pipelines: thousands of workers starting simultaneously, combined with misconfigured or problematic workers, can overwhelm shared cluster services and networking; naive retry logic compounds the problem; adding more workers when a job fails can compound it further
  • Buggy pipelines at scale (10k workers) are always hard on the infrastructure
  • Workflow as Model-View-Controller:
    • Task Master (Model): holds all job states in memory, synchronously journals mutations to persistent disk; can have task groups corresponding to pipeline stages
    • Workers (View): completely stateless and ephemeral; continually update system state transactionally with the master
    • Controller (optional): auxiliary activities — runtime scaling, snapshotting, workcycle state control, rolling back pipeline state
  • Correctness safeguards: config task barriers, mandatory worker leases, unique output naming, mutual validation via server tokens
  • Big data pipelines need to continue processing despite all types of failures

Chapter 26 — Data Integrity: What You Read Is What You Wrote

  • Data integrity: whatever users think it is; more formally, the measure of accessibility and accuracy of the datastores needed to provide users with adequate service
  • An interface bug causing Gmail to fail to display messages is the same as data being gone — from the user's perspective
  • Every service has independent uptime and data integrity requirements, explicit or implicit
  • Secret to superior data integrity: proactive detection and rapid repair
  • Failure mode dimensions:
    • Scope: narrow/directed vs. widespread
    • Rate: big-bang event vs. creeping (distributed application logic contributing to a gradual null value over months)
  • Study of 19 data recovery efforts at Google: most common user-visible data loss = deletion or referential integrity loss due to software bugs; hardest cases = low-grade corruption discovered weeks or months later
  • Replication and redundancy are not recoverability — replicas may contain the same corrupted data; media isolation (tapes) protects from media flaws
  • Backups vs. archives: backups can be loaded back into an application; archives safekeep data for auditing, discovery, and compliance (cannot be directly restored to the app)
  • When formulating backup strategy: how quickly must you recover (RTO)? How much recent data can you afford to lose (RPO)?
  • Defense layers:
    • 1st layer — soft/lazy deletion: delay permanent deletion for 15/30/45/60 days; architecture should prevent developers from circumventing it; also use revision history
    • 2nd layer — backups: focus is recovery, not just backup; questions: which methods, how frequently, where stored, how long retained, are they valid, does recovery complete in time, do you have monitoring for recovery state; always rehearse using automation
    • 3rd layer — data validation: bad data propagates; validate high-impact invariants (not super-strict validation that will be abandoned); ability to drill into validation audit logs is essential; out-of-band validation detects creeping data loss
    • Overarching layer — replication: choose a continuously battle-tested popular scheme; not always feasible for every storage instance
  • Cloud environment considerations: mixture of transactional and non-transactional backup solutions means recovered data won't necessarily be correct; services evolving without maintenance windows means different business logic versions may act on data simultaneously
  • General principles: have a beginner's mind (trust but verify, defense in depth); hope is not a strategy (prove recovery works via automation)
  • "Recognize that not just anything can go wrong, but everything will go wrong"

Chapter 27 — Reliable Product Launches at Scale

  • Google has a special team called Launch Coordination Engineers (LCEs) within SRE
    • Audit products for reliability compliance, liaise between teams, drive technical aspects, act as gatekeepers, educate developers on best practices
    • Expected to have strong communication and leadership skills; mediate between disparate parties toward a common goal
    • LCEs are incentivized to prioritize reliability over other concerns
  • A launch is any new code that introduces an externally visible change; up to 70 launches per week measured at Google
  • Advantages of an LCE team: breadth of experience (work across many products, great for knowledge transfer), cross-functional perspective (holistic view, important for complex multi-team/timezone launches), objectivity (nonpartisan advisors between SRE, product devs, PMs, marketing)
  • Launch process criteria: lightweight (easy on devs), robust (catches obvious errors), thorough (addresses important details consistently), scalable (handles both simple and complex launches), adaptable
  • Tactics to achieve these criteria: simplicity (get the basics right, don't plan every eventuality), high-touch approach (experienced engineers customize per launch), fast common paths (identify common launch patterns and provide simplified processes)
  • LCE launch checklist for "launch qualification":
    • Each entry answers a question and provides a concrete, practical, reasonable action item
    • Each question is there to prevent a past mistake; growth of the list is controlled by rigorous review (top leadership reviews); list is reviewed 1–2 times per year to remove obsolete items
    • Infrastructure/tool standardization (Kubernetes, unified logging) simplifies checklists
    • Checklist themes: architecture and dependencies, integration with internal ecosystem, capacity planning, failure modes, client behavior, processes and automation, development process, external dependencies, rollout planning
  • Selected techniques for reliable launches:
    • Gradual and staged rollouts: canary testing, rate-limited signups; almost all updates at Google done gradually
    • Feature flag frameworks: roll out to few servers/users, gradually increase to 1–10%, direct traffic by users/sessions/locations, automatically handle failures in new code paths, independently revert each change, measure user experience impact
    • Server-side client control: ability to force clients to download config from server; important tool against abusive client behavior
    • Overload behavior and load testing: bring the service to its limits; observe how the service AND surrounding services respond
  • Launch reviews (also called Production Reviews) became common practice days to weeks before launch
  • LCE team was Google's solution to achieving safety without impeding change

Part IV: Management

Chapter 28 — Accelerating SREs to On-Call and Beyond

  • Successful SRE teams are built on trust: trusting teammates to know the system, diagnose atypical behavior, reach out for help, and react under pressure

  • There is no single style of education that works best; you need to develop course content specific to your team's systems and culture

  • Recommended training patterns vs anti-patterns:

    • Concrete sequential learning experiences vs menial work and trial-by-fire
    • Encouraging reverse engineering, statistical thinking, first principles vs training strictly through manuals and checklists
    • Celebrating analysis of failure through postmortems vs treating outages as secrets
    • Contained but realistic breakages to fix vs encountering a problem for the first time during live on-call
    • Roleplaying disasters vs creating subject-matter-expert silos
    • Shadowing early in rotation vs pushing into primary before holistic understanding is achieved
  • Training activities should be appropriately paced; any type of structured training is better than random tickets and interrupts

  • Starting point for learning the stack: how does a request enter the system? How is the frontend served? How is load balancing/caching set up? What are typical debugging, escalation, and recovery procedures?

  • On-call learning checklist: lists frontend apps, backend dependencies, SRE experts, developer contacts, and critical knowledge to internalize (clusters, rollback procedures, critical paths); not a playbook — focuses on expert contacts and must-know knowledge

  • Tiered access model: start with read-only access, progress to write access ("powerups" on the route to on-call)

  • Good starter project patterns: make a trivial user-visible feature change end-to-end, add monitoring for a blind spot you found, automate a pain point

  • Five practices for aspiring on-callers:

    1. Read and share postmortems; collect best ones prominently for newbies; use them for Wheel of Misfortune rehearsals; "the most appreciative audience of a postmortem is an engineer who hasn't yet been hired"
    2. Disaster roleplaying (Wheel of Misfortune): 30–60 minute session, primary and secondary attempt root cause, GM can intervene with details to keep it moving
    3. Break real things, fix real things: divert one instance from live traffic, try to break it from a known good configuration, observe how upstream and downstream systems respond
    4. Documentation as apprenticeship: on-call checklist must be internalized before shadowing; establishes system boundaries and what's most important
    5. Shadow on-call early and often: copy alerts to newbie during business hours first; co-author postmortems; use reverse shadowing (senior watches newbie become primary)
  • Some teams conduct final exams before granting full access; on-call is a rite of passage and should be celebrated

Chapter 29 — Dealing with Interrupts

  • Operational load categories:
    • Pages: production alerts requiring immediate response; always have an associated SLO (minutes); managed by dedicated primary on-call
    • Tickets: customer requests requiring action; SLO measured in hours/days/weeks; should not be randomly assigned to team members; processing tickets is a full-time role
    • Ongoing operational activities: flag rollouts, answering support questions, time-sensitive inquiries
  • Metrics for managing interruptions: interrupt SLO / expected response time, number of backlogged interrupts, severity, frequency, number of people available to handle them
  • Most stressed-out on-call engineers are either dealing with pager volume or treating on-call as a constant interrupt — living in a state of constant interruptability is extremely stressful
  • Assign a real cost to context switches: a 20-minute interruption while on a project entails two context switches; realistically results in a loss of a couple hours of truly productive work
  • Polarize time: be clearly in "project mode" or "interrupt mode" — don't try to mix both simultaneously
  • For any interrupt class where volume is too high for one person, add another person
  • On-call principles:
    • Primary should focus only on on-call work; during quiet times, handle tickets or non-critical interrupt work
    • Primary doesn't progress project work; account for this in sprint planning; if there's important project work, don't put that person on-call
    • Secondary may do project work; could support primary in high-pager-volume situations by team agreement
  • Don't spread ticket load across the entire team — it creates context switches for everyone
  • Ticket handoffs should be done the same way as on-call handoffs
  • Regularly examine tickets to identify classes of interrupts with a common or root cause

Chapter 30 — Embedding an SRE to Recover from Operational Overload

  • One way to relieve burden on an overloaded team: temporarily transfer an SRE into the team
    • The embedded SRE focuses on improving practices, not just emptying the ticket queue
    • One SRE transfer usually suffices
  • "More tickets should not require more SREs" — remind teams of this; unless complexity rises, headcount should not scale with ticket volume
  • Identifying kindling (potential crises to address proactively):
    • Subsystems not designed to be self-managing
    • Knowledge gaps within the team
    • Services quietly increasing in importance without being recognized as such
    • Strong dependence on "the next big thing" ("the new architecture will change everything — better not do anything now")
    • Common alerts not diagnosed by either dev or SRE teams
    • Services with complaints but no formal SLO/SLA
    • Services where capacity planning always ends at "add more servers"
    • Postmortems where the only action items are rolling back the specific change
    • Services nobody wants to own (or that devs own one-sidedly)
  • Phases of embedded SRE engagement:
    • Phase 1 — Learn and get context: shadow on-call, understand what prevents the team from improving reliability, identify largest problems and potential emergencies
    • Phase 2 — Share context: write a blameless postmortem for the team; sort fires into toil vs. non-toil
    • Phase 3 — Drive change: start with the basics — write SLOs if they don't exist; resist the urge to fix kindling yourself; instead, find accomplishable work for anyone on the team, explain usefulness, review their code, repeat; explain your reasoning to build mental models; ask leading questions
    • Final phase — After-action report: a "postvitam" explaining critical decisions at each step that led to success
  • "An SLO is probably the single most important lever for moving a team from reactive ops work to a healthy, long-term SRE focus"
  • Bad apple theory is flawed: systems with multiple interactions make errors inevitable; success requires establishing proper conditions and teaching sound decision-making principles

Chapter 31 — Communication and Collaboration in SRE

  • Production meetings: articulate the state of services, boost org awareness; typical agenda covers upcoming production changes, performance metrics, past outages, paging events, issues requiring attention
  • SRE Tech Lead role: code review, quarterly presentations outlining team strategy, facilitating consensus-building; provides direction for the team
  • Tech lead vs. manager distinction: tech lead handles most technical management; manager adds performance evaluation and broader organizational responsibilities beyond technical oversight
  • Clear communication is an operational skill; good meetings need ownership, purpose, and output
  • Documentation and status writing are part of the job, not peripheral chores; a technically strong team can still be operationally weak if it communicates poorly

Chapter 32 — The Evolving SRE Engagement Model

  • Production Readiness Review (PRR) phases:

    1. Engagement: teams discuss SLOs/SLAs and plan modifications to enhance dependability
    2. Analysis: service evaluated against production standards and industry practices
    3. Improvements and refactoring: improvements prioritized and negotiated between dev and SRE
    4. Training: staff gain knowledge of system architecture and operational procedures
    5. Onboarding: progressively transfers responsibilities and ownership of various production aspects
    6. Continuous improvement: maintaining established reliability standards over time
  • PRR helps teams identify what reliability measures a specific service needs, based on its unique characteristics

  • Early engagement: bringing SRE participation earlier in development allows assessment of business importance and whether a service's scale justifies deep SRE involvement

  • Sustainable SRE-driven development: codified best practices, reusable components, standardized platforms, and automated systems that enable smarter infrastructure management

  • SRE capacity is finite; different products have different risk profiles, maturity levels, and engineering cultures; a one-size-fits-all engagement model either wastes scarce expertise or spreads it too thin

  • Engagement models should make it obvious what a team must do to earn deeper SRE support or graduate from it


Part V: Conclusions

Chapter 33 — Lessons Learned from Other Industries

  • Four core SRE concepts that parallel mature safety-critical industries: preparedness and disaster training, postmortem culture, automation and reduced operational overhead, structured and rational decision-making
  • Cross-industry practices relevant to SRE:
    • Unwavering organizational commitment to safety protocols
    • Meticulous attention to operational details
    • Maintaining excess capacity for contingencies
    • Regular simulation exercises and hands-on drills
    • Comprehensive staff development and credentialing programs
    • Thorough upfront investigation of system specifications and architectural planning
    • Layered protective measures against failures
  • Software culture is often too eager to believe it invented operational seriousness; fields like aviation, medicine, and manufacturing have been managing risk, human factors, and procedure for much longer
  • The best reliability thinking is interdisciplinary; steal aggressively from mature safety disciplines rather than reinventing operational ideas within the boundaries of tech culture

Chapter 34 — Conclusion

  • Reliability is not a collection of tricks; it is a way of making trade-offs visible, encoding operational knowledge in systems, and building organizations that can change safely
  • Error budgets, toil reduction, actionable alerting, blameless postmortems, graceful overload handling, and sustainable on-call are operating principles, not trends
  • The book rejects two bad extremes: the fantasy that reliability can be reduced to process theater alone, and the fantasy that brilliant engineers can improvise their way through production forever
  • The answer is engineering discipline, measurement, sane incentives, and a bias toward simplicity