Notes: Site Reliability Engineering: How Google Runs Production Systems

Part I: Introduction

Chapter 1 — Introduction

SRE is what happens when you ask a software engineer to design an operations team
Development teams want to launch features; ops teams want stability — because most outages come from changes, these goals are fundamentally in tension; SRE is the resolution
Google places a 50% cap on aggregate "ops" work (tickets, on-call, manual tasks) for all SREs; it's an upper bound, not a target
- Excess ops work is redirected back to product teams until load drops below 50%
- Time spent on ops is tracked; remaining time goes to project/engineering work
SREs should receive on average a maximum of two events per 8–12 hour on-call shift; consistently fewer than one per shift is also a waste
DevOps vs SRE: DevOps is a generalization of SRE principles to a wider range of organizations; SRE is a specific implementation of DevOps with some idiosyncratic extensions
SRE team is responsible for: availability, latency, performance, efficiency, change management, monitoring, emergency response, capacity planning
Postmortems should be written for all significant incidents regardless of whether a page was triggered
- Postmortem goal: document what happened, find all contributing causes, assign corrective actions
- Google operates under a blame-free postmortem culture
100% is the wrong reliability target for basically everything; no user can tell the difference between 100% and 99.999%
- Once an availability target is set, the remaining tolerance is the error budget
- Error budget can be spent on anything — features, experiments, rollouts
- Error budget removes the structural conflict of incentives between dev and SRE; outages become expected parts of the innovation process, not catastrophes
Reliability is a function of MTTF and MTTR; Google observed 3x improvement in MTTR when playbooks were introduced vs improvisation
~70% of outages are caused by changes to a live system; best practices: progressive rollouts, fast detection, safe rollbacks
Three valid outputs of a monitoring system: alerts (immediate human action needed), tickets (eventual action needed), logs (no action needed)
Capacity planning requires accurate organic demand forecasts, incorporation of inorganic demand, and regular load testing
A slowdown in a service equates to a loss of capacity; provision to meet a capacity target at a specific response speed

Chapter 2 — The Production Environment at Google, from the Viewpoint of an SRE

DC topology: machines → racks → rows → clusters → datacenter → campus
Homogeneous environments, common deployment patterns, shared storage, and shared scheduling are what make reliability possible at scale
Every service depends on other services; those dependencies define the failure modes
Platform standardization is reliability work — it is one of the main ways SRE scales impact without scaling headcount

Part II: Principles

Chapter 3 — Embracing Risk

Cost of reliability has two dimensions: cost of redundant resources, and opportunity cost of engineering time spent on risk-reduction instead of features
SREs see risk as a continuum; a target is both a minimum and a maximum — you want to exceed it, but not by much
Time-based availability: uptime / (uptime + downtime); e.g., 99.99% target = max ~52 minutes downtime per year
Aggregate availability: successful requests / total requests; e.g., 2.5M reqs/day at 99.99% target = max 250 errors/day
Factors to consider when assessing a service's risk tolerance: required availability level, impact of different failure types, cost of the service, other relevant metrics (latency, etc.)
Things to consider when setting an availability target: what will users expect, is the service tied to revenue, is it paid or free, what do competitors offer, is it consumer or enterprise?
Cost calculation example: improving 99.9% → 99.99% on a $1M/year service has a value of ~$900; only worthwhile if the improvement costs less
Error budget formation: product management defines an SLO; actual uptime is measured by monitoring; the gap is the error budget
- As long as budget remains, new releases can be made
- Error budget can block deployments temporarily to pressure reliability focus
- SREs must have authority to stop launches if budget runs out
- Sometimes an SLO has to be loosened to allow more innovation
Typical factors causing dev/SRE tension: software fault tolerance, testing depth, push frequency, canary duration and size — error budgets make this balance data-driven instead of political

Chapter 4 — Service Level Objectives

SLI (Service Level Indicator): quantitative measure of some aspect of service level (latency, error rate, throughput, availability, durability)
SLO (Service Level Objective): target value or range for an SLI; structure is SLI ≤ target or lower bound ≤ SLI ≤ upper bound
- Without explicit SLOs, users form their own beliefs about performance — leading to over-reliance or under-reliance
SLA (Service Level Agreement): explicit or implicit contract with consequences for missing SLOs; easy test — "what happens if the SLO isn't met?" — if nothing, it's an SLO not an SLA
SLIs by system type:
- User-facing: availability, latency, throughput
- Storage: latency, availability, durability
- Big data: throughput, end-to-end latency
- All systems: correctness
Most metrics are better thought of as distributions, not averages
- 99th percentile shows plausible worst-case; 50th percentile shows typical case
- High variance in response times affects user experience disproportionately; some teams focus only on high percentile values
- Metrics averaged per minute can hide bursts
Collect client-side metrics when possible; not measuring at the client misses problems that don't show up server-side
Chubby example: Chubby was so reliable that teams stopped designing for its absence; solution was to take it down deliberately when it was too far above its SLO for the quarter
Tips for choosing SLO targets:
- Don't pick a target based on current performance — you might be supporting a system that requires heroic effort
- Keep it simple; complicated aggregations obscure changes
- Avoid absolutes (YAGNI)
- Have as few SLOs as possible; defend each one you pick
- Perfection can wait — start loose and tighten, not the other way around
Keep a safety margin: use internal SLOs (stricter) and external SLOs (looser); don't advertise your internal target externally
Don't overachieve: users become dependent on over-performing services; deliberately throttle or take the system offline occasionally to avoid over-reliance
SLOs should specify how they're measured and conditions under which they're valid, e.g., "99% of GET RPC calls will complete in < 100ms averaged over 1 minute across all backend servers"
Error budget is effectively an SLO for meeting other SLOs; track it daily/weekly; upper management typically looks at monthly or quarterly

Chapter 5 — Eliminating Toil

Toil is work tied to running a production service that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth
- Overhead (admin work not tied directly to running a service) is different from toil
- If human judgment is essential, it might not be toil
- If the service remains in the same state after you finish the task, it was probably toil
Typical SRE activities: software engineering, systems engineering, toil, overhead
Top sources of toil: non-urgent service-related messages/email, on-call response, releasing
Every SRE needs to spend at least 50% of their time on engineering work (when averaged over a few quarters — toil tends to be spiky)
If individual SREs report excessive toil, it's a signal for managers to redistribute load more evenly and help those SREs find engineering projects
Toil in small amounts can be tolerable; some people gravitate toward it
Toil becomes toxic in large quantities — leads to career stagnation, burnout, boredom, sets a precedent for loading more toil onto SREs, promotes attrition, and causes breach of faith with new hires promised project work

Chapter 6 — Monitoring Distributed Systems

Monitoring: collecting, processing, aggregating, and displaying real-time quantitative data about a system
White-box monitoring: based on metrics exposed by internals (logs, JVM profiling, HTTP handlers); Black-box monitoring: testing externally visible behavior as a user would see it
Why monitor: analyze long-term trends, compare over time or experiment groups, alerting, build dashboards, conduct ad hoc retrospective analysis
Never trigger an alert simply because "something seems a bit weird" (security auditing on very narrow scopes is an exception)
When pages occur too frequently, engineers second-guess, skim, or ignore alerts — including real ones masked by noise
Avoid "magic" systems that try to learn thresholds or automatically detect causality (rules detecting unexpected changes in end-user request rates are a valid counter-example)
Complex dependency hierarchies ("if DB is slow, alert for DB; otherwise alert for website") only work for very stable system parts
Four golden signals:
- Latency: time to service a request; distinguish latency of successful requests vs. failed ones; increases in latency are an early indicator of saturation
- Traffic: how much demand is being placed on the system (requests/s, broken out by request type)
- Errors: explicit (500s), implicit (200 with wrong content), or by policy (response over 1s = error if you've committed to 1s)
- Saturation: how "full" the service is; emphasizes the most constrained resource; many systems degrade before 100% utilization so having a utilization target is essential; also covers predictions of impending saturation ("DB will fill in 4 hours")
If you measure all four golden signals and page when one is problematic, your service will be at least decently covered
For tail latency: collect request counts bucketed by latencies (histograms) rather than raw latencies
Alerting rules for humans should be simple, predictable, reliable, and represent a clear failure
Questions to ask before creating an alert:
- Does this detect an otherwise-undetected condition that is urgent, actionable, and user-visible?
- Will I ever be able to ignore this alert knowing it's benign?
- Can I take action in response? Is that action urgent, or can it wait until morning?
- Are other people already being paged for this?
On pages: every page should be actionable; every page response should require intelligence; pages with rote algorithmic responses are a red flag; pages should be about novel problems
Spend more effort catching symptoms than causes; only worry about definite, imminent causes
In Google's experience: simple collection + aggregation + alerting + dashboards works well; add complexity only when needed
Periodical reviews of page frequency done with management in quarterly reports (target: a couple of pages per shift)
Often, sheer heroic effort can achieve high availability short-term — but a controlled short-term hit is usually a better long-run trade than sustained burnout

Chapter 7 — The Evolution of Automation at Google

"Automation is a force multiplier, not a panacea"
Automation is meta-software: software to act on software
Doing automation thoughtlessly can create as many problems as it solves
Value of automation: consistency (very few humans act with equal consistency every time), platform extensibility, reduced MTTR, faster non-repair actions (failovers, traffic switching), time savings — decoupling the operator from the operation is powerful
Hierarchy of automation maturity:
1. No automation
2. Externally maintained, system-specific automation (script in an SRE's home folder)
3. Externally maintained, generic automation (documented for the team)
4. Internally maintained, system-specific automation (versioned to the system's repo)
5. Systems that don't need any automation (the goal)
Infrequently run automation is fragile
Relieving teams from ops responsibility can remove their incentive to reduce tech debt; product managers not affected by low-quality automation will always prioritize new features
Automation failure risk: when automation covers more and more daily activities, human operators lose direct contact with the system; when automation fails, humans may no longer be able to operate it — this is unavoidable in sufficiently autonomous systems, and must be accounted for
Case study (Ads Database): failovers automated, outage no longer paged a human; total operational maintenance cost dropped ~95%; up to 60% of hardware utilization freed
Case study (Cluster turnups): early automation was an initial win, but free-form scripts became technical debt; Prodtest (Python unit test framework extended for real-world services) created a chain of tests that could validate a service's configuration across all clusters

Chapter 8 — Release Engineering

Release engineering is a distinct discipline: release engineers work with SWEs and SREs to define how software is released, from version control through compilation, testing, packaging, and deployment
High velocity models: some teams do hourly builds with deploy based on test results; others use "push on green" (every build that passes tests goes to production)
Hermetic builds: building the same revision always produces identical results; self-contained including the compiler version; allows cherry-picking fixes against old revisions to fix production software
All code lives in the main branch; releases are branched off; fixes flow from main to the release branch via cherry-pick; branches never merge back
CI creates an audit trail: tests ran, tests passed
Config management: deceptively simple, a major source of instability; all schemes should involve source control and strict review
- Option: use mainline for config (decouples binary releases from config changes)
- Option: include config files in the same package as the binary (simple, but tightly coupled)
- Option: package config separately using the same hermetic principle as code
Six gated operations requiring approval: source code change, release process action definitions, new release creation, integration proposal approval, release deployment, build configuration modifications
Packages are named, versioned with a unique hash, and signed
Budget for release engineering resources at the beginning of the product development cycle — it's cheaper to do it now than later
Common questions every team needs to answer: how to handle package versioning, CI or CD, release cadence, config management policies, release metrics

Chapter 9 — Simplicity

"At the end of the day, our job is to keep agility and stability in balance in the system"
"The price of reliability is the pursuit of the utmost simplicity"
"Unlike a detective story, the lack of excitement, suspense, and puzzles is actually a desirable property of source code"
Reliable systems can increase agility: reliable rollouts make it easier to link changes to bugs
Essential vs accidental complexity: SREs should push back when accidental complexity is introduced
Code is a liability, not an asset; remove dead code and other bloat
Commented-out code is an anti-pattern; forever-gated feature flags are an anti-pattern (flags should be actively rehearsed and removed)
Smaller APIs are easier to test and more reliable; avoid misc/util classes
Small releases are easier to debug and measure; you can't tell what happened if 100 changes were released together
No monitoring: you're blind; SREs don't go on-call for the sake of it, they do it to stay in touch with how systems work and fail
Managing incidents effectively reduces impact and limits outage-induced anxiety; blameless postmortems are the first step to understanding what went wrong

Part III: Practices

Chapter 10 — Practical Alerting

Monitoring a very large distributed system presents challenges: vast number of components, need for low maintenance burden
Borgmon (Google's internal monitoring system, conceptually similar to Prometheus): a programmable calculator with syntactic sugar for generating alerts using a common data exposition format
Time-series data: conceptually a 2D array with time on one axis and items on the other; each series named by a unique set of labels (name=value)
- Data points are (timestamp, value) stored in chronological lists
- Data stored in-memory, checkpointed to disk; fixed-size allocation; oldest entries GC'd when full
Counters: monotonically non-decreasing variables (km driven, request count); preferred over gauges because they don't lose meaning when events occur between sampling intervals
Gauges: any value, doesn't have to be monotonically shifting (fuel remaining, current speed)
Labels serve three purposes: define breakdowns of the data itself, define the source of the data (service name, container), indicate locality or aggregation (zone, shard)
Alertmanager: can inhibit certain alerts when others are active, deduplicate alerts from multiple instances with same labelsets, fan-in or fan-out alerts based on labelsets
White-box monitoring (Borgmon/Prometheus): sees system internals; Black-box monitoring (Prober): looks at system from outside, monitors what the user sees
Page-worthy alerts go to on-call rotation; non-page-worthy alerts go to a separate processing queue or as informational data — this distinction is highlighted repeatedly as an Important Detail
Rules generating alerts for humans should be simple, represent clear failures, and require intelligence to respond to

Chapter 11 — Being On-Call

On-call = being available to step in, reacting within a specific time bound (minutes or hours depending on SLA)
Typical response times: 5 minutes for user-facing or time-critical tasks; 30 minutes for less time-sensitive
When a page arrives: acknowledge, triage, escalate if necessary
Non-paging events are less urgent but on-call engineers should vet them during business hours
Primary and secondary roles: primary for pages; secondary falls through for primary, handles non-paging events, escalation support
Balancing on-call quantity: SREs should spend ≥50% of time on engineering; of remaining time, no more than 25% on on-call
Balancing on-call quality: incident handling (root cause, remediation, postmortem, bug fix) takes ~6 hours; therefore max ~2 incidents per 12-hour shift; strive for a very flat distribution of pages with median 0
Night shifts degrade health; multi-site teams eliminate night shifts by following the sun; caveat: significant communication and handoff overhead
SRE managers must keep quantity and quality balanced
Most important on-call resources: clear escalation paths, well-defined incident management procedures, blameless postmortem culture
All paging events should be actionable; silencing noisy non-actionable alerts reduces fatigue
If there's more than one alert for one incident, strive for a 1:1 alert-to-incident ratio
If too many pages occur, give the pager to the developers owning those services and work with them until alert quality returns to standard; feature development halts until this is resolved
Compensation: extra pay or time-off for on-call shifts; capped at a proportion of salary to incentivize involvement while limiting burnout
Operational underload is also a problem: teams should be sized so every engineer is on-call once or twice a quarter, to stay in touch with production
Wheel of Misfortune helps hone SRE capabilities
Two modes of thinking under pressure: intuitive/rapid action vs rational/deliberate cognitive function; the latter leads to better outcomes during incidents
SRE teams can be loaned to overloaded teams temporarily; measure overload symptoms (daily tickets, paging events per shift) explicitly

Chapter 12 — Effective Troubleshooting

Troubleshooting is the application of the hypothetico-deductive method: iterate hypotheses until one holds
Troubleshooting is learnable; expertise comes from investigating failures, not just understanding normal operation
Ideally a problem report gives the top-level symptom; start drilling down into telemetry and logs, narrow down culprits, exclude parts of the system (bisection is a useful tactic), identify contributing factors
Two ways to test hypotheses: compare observed state against theory to find confirming/disconfirming evidence, or treat the system (change something in a controlled way and observe)
Common troubleshooting pitfalls:
- Looking at irrelevant symptoms (wild goose chases)
- Misunderstanding system dynamics (inputs, behavior, outputs)
- Coming up with wildly improbable theories
- Hunting down spurious correlations and coincidences
- Confusing correlation with causation
Always prefer simple explanations; the four golden signals are useful scaffolding for building simple explanations
An effective problem report contains: expected behavior, reproduction steps, consistent form, and exists somewhere searchable
Stop the bleeding first — make the software work before investigating root cause; preserve earlier evidence of the incident for later
Structured logs are important for retrospective analysis; pass trace IDs using a common standard through all layers
Design systems with well-understood and observable interfaces between components; observability-driven engineering makes troubleshooting sessions dramatically shorter

Chapter 13 — Emergency Response

Don't panic: you're not alone, the sky is not falling, nobody is dying; if you feel overwhelmed, pull in more people (sometimes everyone has to be paged)
Follow an incident response process; not following it is itself a contributing cause of incidents
Three types of emergencies:
- Test-induced: planned, proactive ways to break production; failures are controlled and aborted when things go wrong
- Change-induced: incident stems from deployment or configuration changes
- Process-induced: incident caused by a process (usually automated) that wreaks havoc (e.g., automation that wipes hard drives)
Lessons from test-induced emergency example:
- Nobody really understood how two systems interacted — review hadn't been good enough despite many eyes
- Incident response process was not followed, which prevented wider awareness
- Rollback procedures had not been rehearsed in a test environment — they were broken
- Now: rollback procedures are tested before any large-scale test
"All problems have solutions — a solution exists, even if it's not obvious, especially to the person whose pager is screaming"
Involve the person whose actions triggered the incident; they know the most and change-induced emergencies are typically fixed faster with them involved
Keep a history of outages: ask hard questions, look for strategic (not just tactical) preventive actions, publish postmortems somewhere everyone can read them, hold people accountable to follow-up actions
Until a system has failed, you don't know how it, its upstream systems, or its users will react — don't assume

Chapter 14 — Managing Incidents

Recursive separation of responsibilities: delegate distinct roles with clear boundaries
- Incident Commander: structures the response task force, assigns responsibilities, holds all undelegated roles; most important task is maintaining a living incident document (war room)
- Ops Lead: works with incident command; only person modifying the system during the incident
- Communications Lead: public face of the task force; provides periodic updates to the team and stakeholders
- Planning Lead: handles longer-term issues — bug filing, arranging dinner, tracking handoffs, recording how the system has diverged from normal so it can be reverted
A single war room (physical or virtual) is recommended; incident command handoffs must be done loudly and explicitly with explicit acknowledgment from all participants
When to declare an incident (declare early rather than late): do you need a second team? Is the outage customer-visible? Has the issue gone unsolved after an hour of concentrated effort?
Incident management proficiency degrades when not in regular use
Best practices:
- Prioritize: stop the bleeding, restore service, preserve evidence for the postmortem
- Prepare: develop and document procedures in advance with incident participants
- Trust: give full autonomy within each assigned role
- Introspect: if you feel panicky or overwhelmed, get more support
- Consider alternatives: periodically re-evaluate whether the current approach still makes sense
- Practice: use the process routinely so it becomes second nature
- Rotate roles: encourage every team member to gain familiarity with each role

Chapter 15 — Postmortem Culture: Learning from Failure

"The cost of failure is education"
Postmortem definition: a written record of an incident, its impact, the actions taken to mitigate or resolve it, the contributing causes, and follow-up actions to prevent recurrence
Primary goals: document the incident, understand all contributing causes, take preventive actions to reduce likelihood and/or impact of recurrence
Writing a postmortem is not a punishment; it's a learning opportunity; any stakeholder may request one
Blamelessness:
- Must not indict any individual or team for bad or inappropriate behavior
- Assumes everyone had good intentions and did the right thing with the information they had at the time
- When done well, leads to investigating why individuals had incomplete/incorrect information
- When done badly, leads to finger-pointing and shaming — and, critically, to people hiding information in future incidents
Postmortem review criteria used at Google: was key incident data collected? Are impact assessments complete? Is the action plan appropriate? Are resulting bug fixes at appropriate priority? Did we share the outcome with relevant stakeholders?
An unreviewed postmortem might as well never have existed
Tools for introducing postmortem culture: postmortem of the month newsletter, postmortem reading clubs (regular sessions where impactful postmortems are read aloud), Wheel of Misfortune (re-enact a previous postmortem with the original incident commander present)
Make writing effective postmortems a rewarded and celebrated practice; even senior leadership should acknowledge and participate (book mentions Larry Page talking about the value of postmortems)
Ask for feedback on effectiveness: is the culture supporting your work? Does writing one entail too much toil? What tools would you like to see?

Chapter 16 — Tracking Outages

Postmortems provide useful insights for individual services but can miss opportunities with small per-service impact but large horizontal impact
The Escalator: Google's in-house PagerDuty equivalent; centralized system tracking ACKs to alerts, notifies others if necessary
The Outalator: time-interleaved view of notifications for multiple queues at once; allows annotating incidents, marking annotations as important, silently saving email replies, and combining multiple escalating notifications into a single incident entity
A single event often triggers multiple alerts; the ability to group multiple alerts into a single incident is critical
Track outages with consistent definitions, user impact, duration, and cause categories; this makes reliability visible enough to influence prioritization

Chapter 17 — Testing for Reliability

"If you haven't tried it, assume it's broken"
Confidence comes from both past reliability and future reliability; for future predictions to hold, either the system remains completely unchanged or you can confidently describe all changes
Passing tests doesn't prove reliability; failing tests generally prove its absence
Zero MTTR: a system-level test that detects exactly the same problem monitoring would detect; repairing these bugs by blocking a push is quick and convenient
The more bugs caught pre-production (zero MTTR), the higher the MTBF
Test types:
- Unit tests: smallest/simplest; assess a single unit of software (class, function) independent of the larger system
- Integration tests: assembled component verification; use dependency injection and mocks to test components in isolation
- System tests: largest scale for undeployed systems; end-to-end functionality
  - Smoke tests: simple but critical behavior; short-circuit additional expensive testing
  - Performance tests: check performance stays acceptable over the lifecycle
  - Regression tests: prevent known bugs from sneaking back; gallery of rogue bugs
- Stress tests: find the limits of a web service
- Canary tests: a subset of servers upgraded to a new version/config and left in incubation; not really a test, it's structured user acceptance
Canary tests: not necessary to achieve fairness among fractions of user traffic when using exponential rollout
CI/CD: optimal when engineers are notified when the build pipeline fails; deblocking pipelines should always be the first priority
Config files that change more often than once per application release are a major reliability risk if those changes aren't treated the same as application releases
Config file contents are potentially hostile to the interpreter reading them — a potential security threat vector
Disaster recovery tools should work "offline" using checkpoint states; they're expected to work with instant consistency, not eventual consistency
Statistical techniques like fuzzing and chaos testing aren't necessarily repeatable; improve repeatability using seeded random number generators
Key element of site reliability: find each anticipated form of misbehavior and make sure some test reports it
SRE tools need to be tested too (tools that retrieve/propagate metrics, predict usage, plan capacity)

Chapter 18 — Software Engineering in SRE

Growth rate of SRE-supported services exceeds the growth rate of the SRE organization; one SRE guiding principle is that team size should not scale directly with service growth
Team diversity is critical: a mix of traditional software development and systems engineering backgrounds helps prevent blind spots
Intent-based capacity planning: specify requirements (intent), not implementation; encode them, autogenerate the allocation plan
- Ladder of increasingly intent-based planning:
  1. "I want 50 cores in clusters X, Y, Z" — why those?
  2. "I want 50 cores in any 3 clusters in region" — why 50, why 3?
  3. "I want to meet demand with N+2 redundancy" — why N+2?
  4. "I want 5 nines of reliability" — could find N+2 insufficient
- Greatest gains from going to level 3; some sophisticated services go to level 4
Auxon case study (Google's intent-based capacity planning tool):
- Built by an SRE who was managing capacity in spreadsheets, then formalized into a full product with backlog, SLA, team ownership
- Inputs: requirements/intent, performance data, demand forecasts, resource supply and pricing
- Uses a mixed-integer or linear programming solver
- Key learnings: don't focus on perfection, launch and iterate; a single email doesn't drive adoption — it needs consistent approach, user advocacy, and senior sponsorship; small releases build confidence; don't over-customize for a few big users; avoid 100% adoption rate; "seed team" of generalists + deep-expertise engineers works well
SRE software must be designed for scalability, graceful degradation on failure, and easy integration with other infrastructure
Good candidate SRE projects: reduce toil, improve existing infrastructure, streamline a complex process, and must fit org-wide objectives
SREs who build products should continue working as SREs rather than becoming embedded developers — they dogfood the tools and bring an invaluable operational perspective
Guidelines for building SRE software: create a clear message and communicate benefits (SREs are skeptical), evaluate org capabilities, launch and iterate to establish credibility, hold yourself to the same standards as a product team

Chapter 19 — Load Balancing at the Frontend

DNS is typically the first layer of load balancing; conceptually simple but many dragons exist
- Very little control over client behavior; records selected randomly
- DNS server acts as a caching layer: recursive resolution makes it difficult to find the optimal IP for a given user; responses are cached with TTL
- DNS alone is insufficient; not the right solution for fine-grained control
Better approach: DNS combined with Virtual IP (VIP) addresses
- Network Load Balancer sits in front; receives packets and routes to available backends
- Consistent hashing: mapping algorithm that remains stable when backends are added or removed, minimizing disruption to existing connections
- Simple connection tracking as default; fall back to consistent hashing under pressure
Packet forwarding strategies:
- NAT: assumes a completely stateless fallback mechanism
- Direct Server Response (DSR) (layer 2 modification): all LBs and backends must be reachable at the data link layer; Google stopped using this
- Packet encapsulation (GRE): Google started using this; introduces overhead (~24 bytes for IPv4+GRE) that can exceed MTU and require fragmentation
Load balancer should always prefer redirecting to the least loaded backend

Chapter 20 — Load Balancing in the Datacenter

Backend service: typically 100–1000 processes; ideal goal is perfectly distributed load
Lame duck state: backend task is listening on its port and can serve, but explicitly asks clients to stop sending new requests; broadcasts this state to all active clients
- Main advantage: simplifies clean shutdown; avoids serving errors to requests active on shutting-down tasks
- Shutdown sequence: scheduler sends SIGTERM → task enters lame duck → clients redirect new requests elsewhere → ongoing requests complete → task exits cleanly (or is killed)
Traffic sinkholing: a client sends very large amounts of traffic to an unhealthy task because the unhealthy task returns errors with very low latency, causing the client to increase request rate
- Fix: tune load balancer to count recent errors as active requests
If outgoing request latency grows (e.g., competition for network resources from a noisy neighbor), active request count also grows — can trigger GC
When a task restarts, it often requires significantly more resources for a few minutes (initialization cost, cold cache, JIT warmup)
Subsetting: clients interact with a limited subset of backends (typically 20–100); reduces connection overhead while maintaining health checking
Subset selection algorithms: random (bad utilization) → round-robin (permuted order) → deterministic subsetting (each backend assigned to exactly one client per round)
Load balancing policies:
- Round-robin: 2x difference observed between most and least loaded in practice
- Least-loaded round-robin: rounds among least-loaded; load measured by connection count; still suboptimal since it's per-client, not global
- Weighted round-robin: clients maintain capability scores per backend; backends report query rate, error rate, utilization in responses; clients adjust scores periodically; best distribution in practice — recommended

Chapter 21 — Handling Overload

Gracefully handling overload is fundamental to running a reliable service
Strategy: redirect when possible, serve degraded results when necessary, handle resource errors transparently when all else fails
QPS is a poor capacity metric because different queries have vastly different resource costs; better to measure capacity in available resources (CPU time per request is a good normalized measure)
When global overload occurs: deliver errors to misbehaving customers, other customers remain unaffected; reject out-of-quota requests quickly
Client-side throttling: when most CPU is spent rejecting requests, throttle on the client side
Adaptive throttling: each client tracks two values over a 2-minute window: requests (attempted) and accepts (accepted by backend); once requests = K × accepts, client stops sending; this leads to stable overall request rates in practice
Request criticality (Google's four tiers):
- CRITICAL_PLUS: reserved for most critical; serious user-visible impact if they fail
- CRITICAL: default for production jobs; will cause user-visible impact; services must provision capacity for CRITICAL_PLUS + CRITICAL traffic
- SHEDDABLE_PLUS: partial unavailability expected; default for batch jobs
- SHEDDABLE: frequent partial and occasional full unavailability expected
- Criticality propagates through RPC calls (same criticality level is used for all upstream calls)
- Only reject requests of a given criticality if already rejecting all requests of lower criticalities
Overload protection at Google is based on utilization (CPU rate / total CPUs reserved, executor load average, combined target thresholds); as threshold is reached, requests are rejected based on criticality
Overload errors: if large subset of DC is overloaded, don't retry (errors should bubble up); if small subset is overloaded, prefer immediate retry
Request retries: from the load balancer's perspective, retries are indistinguishable from new requests; can be organic load balancing
- Per-request retry budget (max 3 at Google)
- Per-client retry budget (track retry ratio; only retry if below ~10%)
- Return "overloaded; don't retry" error response when a histogram reveals significant retry volume
- Consider having a server-wide retry budget
Handling burst load: expose load to cross-datacenter load balancing algorithm; use a separate proxy backend for batch jobs to shield fan-outs from user-facing services
Common mistake: assuming an overloaded service should turn down and stop accepting all traffic; instead, accept as much as possible and only shed load as a last resort

Chapter 22 — Addressing Cascading Failures

"If at first you don't succeed, back off exponentially." + "Why do people always forget to add a little jitter?"
Cascading failure: failure that grows over time as a result of positive feedback; can occur when part of a system fails, increasing the probability that other parts also fail
Most common cause: overload
Resource types that can be exhausted: CPU, memory, threads, file descriptors, dependencies among resources
CPU exhaustion secondary effects: increased in-flight requests, longer queues, thread starvation, reduced CPU cache benefits, health check failures
Memory exhaustion secondary effects: dying tasks, increased GC rate in Java (GC death spiral: less CPU → slower requests → increased RAM usage → more GC → even less CPU), reduced cache hit rates, thread starvation
Thread starvation: can directly cause errors, health check failures; if threads added without upper bound, thread overhead uses too much RAM; also risks running out of process IDs
File descriptor exhaustion: inability to initialize network connections → health check failures
Load balancing policies that avoid servers serving errors exacerbate problems (snowball effect on remaining servers)
Strategies for avoiding server overload: load test capacity limits and test failure mode for overload; serve degraded results; instrument to reject requests when overloaded; have higher-level systems reject requests (reverse proxy, load balancer, task); capacity planning
Queue management: keep queue size ≤50% of thread pool size for steady-load services; for bursty workloads, size based on thread count, processing time, and burst size and frequency; consider LIFO queuing or controlled delay (CoDel) algorithm
Load shedding: drop a proportion of traffic as server approaches overload; per-task throttling based on CPU, memory, or queue length; return 503 when too many requests are in-flight
Graceful degradation: reduce amount of work (search in-memory cache instead of DB); keep the degradation path simple; test it regularly (a code path you don't exercise will be broken); design a way to turn it off
Retry guidelines: always use randomized exponential backoff; limit retries per request; avoid retrying at multiple levels (amplifies load catastrophically); separate retriable from non-retriable errors; return a specific "overloaded; don't retry" status; server-wide retry budgets
RPC deadlines: essential to prevent zombie requests consuming resources; propagate deadlines top-down (all downstream RPCs share the same absolute deadline); set an upper bound on outgoing deadlines; deadlines several orders of magnitude longer than mean request latency are usually bad; check deadline before continuing at each processing stage
Propagate cancellations: notify servers in the call chain that their work is no longer needed; some systems "hedge" requests and cancel the rest when one responds
Cold start issues: processes are slower right after starting (initialization, JIT, deferred class loading, cold cache); when adding load to a cluster, increase gradually
Always go downward in the stack; avoid intra-layer communication; communications within a layer are susceptible to distributed deadlocks
Triggering conditions for cascading failures: process death (Query of Death, assertion failures), process updates, new rollouts (config or infra changes), organic growth (usage exceeded capacity estimate), planned drains or turndowns
Depending on slack CPU as a safety net is dangerous; ensure load tests stay within committed resource limits
Testing for cascading failures: test until it breaks; consider both gradual and impulse load patterns; test each component separately; track state between interactions; be careful testing in production
Immediate steps to address cascading failures: increase resources (may not be sufficient alone), stop health check failures, restart servers (especially GC death spirals or deadlocks), drop traffic (last resort — let 1% through only), enter degraded mode, eliminate batch load, remove bad traffic

Chapter 23 — Managing Critical State: Distributed Consensus for Reliability

Distributed consensus problem: reaching agreement among a group of processes connected by an unreliable network — one of the most fundamental concepts in distributed computing
Questions requiring consensus: who is the leader? What processes are in the group? Has a message been committed to the queue? Does a process hold a lease? What is the value for a key?
Whenever you see leader election, critical shared state, or distributed locking: use distributed consensus systems that have been formally proven and tested — don't roll your own
CAP theorem: a distributed system cannot simultaneously have all three of: consistent views of data at each node, availability of data at each node, tolerance to network partitions
FLP impossibility: no asynchronous distributed consensus algorithm can guarantee progress in the presence of an unreliable network; mitigated by having sufficient healthy replicas and network connectivity (plus backoff jitter to avoid dueling proposers)
ACID vs BASE: BASE (Basically Available, Soft state, Eventual consistency) enables higher throughput at the cost of stronger consistency guarantees; eventual consistency can produce surprising results especially with clock drift or network partitions
Case study 1 (split-brain): heartbeats can't be used to solve leader election; slow or packet-dropping network can cause both nodes to issue kill commands to each other
Case study 2 (human intervention failover): human escalation scales poorly; if the network is so badly degraded that consensus can't elect a primary, a human is not better positioned to elect one either
Case study 3 (faulty group membership): gossip-protocol-based cluster formation can lead to split-brain in a network partition, with both sides electing a master and accepting writes
Consensus algorithms: Paxos (original), Multi-Paxos, Raft, Zab, Mencius
Paxos overview: sequence of proposals accepted by a majority; each proposal has a unique sequence number (strict ordering); acceptors agree only if no higher sequence number seen; proposer commits by sending value to acceptors when majority agrees; acceptors must journal to disk when accepting; two different values cannot be committed for the same proposal (any two majorities overlap at ≥1 node)
Replicated State Machine (RSM): executes the same set of operations in the same order on several processes; fundamental building block; any deterministic program can be implemented as highly available service as an RSM
Timestamps are highly problematic in distributed systems; use distributed consensus for ordering instead
Barriers: primitives that block a group of processes from proceeding until a condition is met; split distributed computation into logical phases; can be implemented as an RSM (Zookeeper supports barriers)
Locks should be used with timeouts to prevent deadlocks; supported in RSM
Queueing-based systems: tolerate failure/loss of worker nodes; use lease systems to ensure claimed tasks are processed; implementing the queue as an RSM makes the system far more robust
Atomic broadcast: messages received by all participants reliably and in the same order — an incredibly powerful primitive
Multi-Paxos: strong leader process enables only 1 round trip to reach consensus; backoff jitter and timeouts necessary to avoid dueling proposers
For read-heavy workloads: read-only consensus operation, read from replica guaranteed to be most up-to-date (stable leader can provide this), or quorum leases (strongly consistent local reads at the cost of some write performance)
Two physical constraints on performance: network round-trip time and lead time for writing to persistent storage
Conventional wisdom that consensus algorithms can't be used for high-throughput low-latency is false — proven extremely effective in practice at Google
Minimum replicas for non-Byzantine failures: 3 (2-node quorums cannot tolerate any failure); adding a replica to a majority quorum can decrease availability
Monitor consensus systems closely: number and status of replicas, lagging replicas, whether a leader exists, rate of leader changes (too rapid = flappiness, sudden decrease = serious bug), consensus transaction number (is the system making progress?), throughput and latency
"If you remember nothing else from this chapter": know the problems that distributed consensus can solve and the types of problems that arise from using ad hoc methods like heartbeats

Chapter 24 — Distributed Periodic Scheduling with Cron

Cron: Unix tool for launching arbitrary periodic jobs at user-defined times or intervals
Simple cron failure domain is one machine; only state needing persistence across restarts is crontab configuration; launches are fire-and-forget so launch tracking is not needed (exception: anacron-style catch-up for missed launches)
Cron jobs come in shapes: idempotent (GC, cleanups), side-effectful (email newsletters), time-pressured or not; skipping a cron job is generally better than risking a double run
Hosting cron on a single machine is a reliability catastrophe; decouple the cron service from machines
Two options for tracking state in distributed cron: external distributed storage (better for large blobs, but small writes are expensive and high-latency) vs. small local state stored as part of the cron service (no extra dependencies, but risk of data loss)
Paxos for distributed cron: strong consistency guarantees; leader replica is the only one that actively launches jobs; completion of a launch synced to all replicas; leader election must complete within 1 minute to avoid missing launches
Every cron job launch has two sync points: when the launch happens and when it finishes — these delimit the launch
To reduce missed/double launches when a leader dies: all operations should be idempotent (use ahead-of-time known job names), or have observability to see if launch requests all succeeded
Log compaction: the state change log must be compacted (snapshots work well); can store locally (fast but possible data loss) or externally (not desirable due to small write cost); Paxos helps recover from single-machine log loss via replicas
Thundering herd problem: many concurrent cron jobs spawning HTTP calls can cause spikes; solution: allow ? in crontab schedule fields so the cron system picks the value randomly, effectively adding jitter

Chapter 25 — Data Processing Pipelines

Data pipeline pattern: read data → transform → output; historically: co-routines, DTSS communication files, Unix pipes, ETL pipelines
Simple one-phase pipelines: periodic or continuous transformation on big data; multiphase pipelines chain programs in series — organized for ease of understanding rather than operational efficiency
Periodic pipelines are generally stable when there are sufficient workers AND execution demand is within computation capacity; fragile when growth and changes strain resources
"Embarrassingly parallel" algorithms cut workloads into chunks per machine; end-to-end runtime is capped by the largest customer's runtime
Hanging chunk problem: uneven resource distribution across cluster machines; typical user code waits for total computation to complete — one slow chunk delays everything; responding after detection (e.g., killing the job) can make things worse by restarting from scratch
Excessive batch scheduler usage places jobs at risk of preemption when cluster load is high (other users starved of batch resources)
Moiré load pattern: two or more pipelines run simultaneously, their execution sequences occasionally overlap, simultaneously consuming a shared resource; less common when load arrives more evenly; best observed through shared resource usage
Thundering herd problem in pipelines: thousands of workers starting simultaneously, combined with misconfigured or problematic workers, can overwhelm shared cluster services and networking; naive retry logic compounds the problem; adding more workers when a job fails can compound it further
Buggy pipelines at scale (10k workers) are always hard on the infrastructure
Workflow as Model-View-Controller:
- Task Master (Model): holds all job states in memory, synchronously journals mutations to persistent disk; can have task groups corresponding to pipeline stages
- Workers (View): completely stateless and ephemeral; continually update system state transactionally with the master
- Controller (optional): auxiliary activities — runtime scaling, snapshotting, workcycle state control, rolling back pipeline state
Correctness safeguards: config task barriers, mandatory worker leases, unique output naming, mutual validation via server tokens
Big data pipelines need to continue processing despite all types of failures

Chapter 26 — Data Integrity: What You Read Is What You Wrote

Data integrity: whatever users think it is; more formally, the measure of accessibility and accuracy of the datastores needed to provide users with adequate service
An interface bug causing Gmail to fail to display messages is the same as data being gone — from the user's perspective
Every service has independent uptime and data integrity requirements, explicit or implicit
Secret to superior data integrity: proactive detection and rapid repair
Failure mode dimensions:
- Scope: narrow/directed vs. widespread
- Rate: big-bang event vs. creeping (distributed application logic contributing to a gradual null value over months)
Study of 19 data recovery efforts at Google: most common user-visible data loss = deletion or referential integrity loss due to software bugs; hardest cases = low-grade corruption discovered weeks or months later
Replication and redundancy are not recoverability — replicas may contain the same corrupted data; media isolation (tapes) protects from media flaws
Backups vs. archives: backups can be loaded back into an application; archives safekeep data for auditing, discovery, and compliance (cannot be directly restored to the app)
When formulating backup strategy: how quickly must you recover (RTO)? How much recent data can you afford to lose (RPO)?
Defense layers:
- 1st layer — soft/lazy deletion: delay permanent deletion for 15/30/45/60 days; architecture should prevent developers from circumventing it; also use revision history
- 2nd layer — backups: focus is recovery, not just backup; questions: which methods, how frequently, where stored, how long retained, are they valid, does recovery complete in time, do you have monitoring for recovery state; always rehearse using automation
- 3rd layer — data validation: bad data propagates; validate high-impact invariants (not super-strict validation that will be abandoned); ability to drill into validation audit logs is essential; out-of-band validation detects creeping data loss
- Overarching layer — replication: choose a continuously battle-tested popular scheme; not always feasible for every storage instance
Cloud environment considerations: mixture of transactional and non-transactional backup solutions means recovered data won't necessarily be correct; services evolving without maintenance windows means different business logic versions may act on data simultaneously
General principles: have a beginner's mind (trust but verify, defense in depth); hope is not a strategy (prove recovery works via automation)
"Recognize that not just anything can go wrong, but everything will go wrong"

Chapter 27 — Reliable Product Launches at Scale

Google has a special team called Launch Coordination Engineers (LCEs) within SRE
- Audit products for reliability compliance, liaise between teams, drive technical aspects, act as gatekeepers, educate developers on best practices
- Expected to have strong communication and leadership skills; mediate between disparate parties toward a common goal
- LCEs are incentivized to prioritize reliability over other concerns
A launch is any new code that introduces an externally visible change; up to 70 launches per week measured at Google
Advantages of an LCE team: breadth of experience (work across many products, great for knowledge transfer), cross-functional perspective (holistic view, important for complex multi-team/timezone launches), objectivity (nonpartisan advisors between SRE, product devs, PMs, marketing)
Launch process criteria: lightweight (easy on devs), robust (catches obvious errors), thorough (addresses important details consistently), scalable (handles both simple and complex launches), adaptable
Tactics to achieve these criteria: simplicity (get the basics right, don't plan every eventuality), high-touch approach (experienced engineers customize per launch), fast common paths (identify common launch patterns and provide simplified processes)
LCE launch checklist for "launch qualification":
- Each entry answers a question and provides a concrete, practical, reasonable action item
- Each question is there to prevent a past mistake; growth of the list is controlled by rigorous review (top leadership reviews); list is reviewed 1–2 times per year to remove obsolete items
- Infrastructure/tool standardization (Kubernetes, unified logging) simplifies checklists
- Checklist themes: architecture and dependencies, integration with internal ecosystem, capacity planning, failure modes, client behavior, processes and automation, development process, external dependencies, rollout planning
Selected techniques for reliable launches:
- Gradual and staged rollouts: canary testing, rate-limited signups; almost all updates at Google done gradually
- Feature flag frameworks: roll out to few servers/users, gradually increase to 1–10%, direct traffic by users/sessions/locations, automatically handle failures in new code paths, independently revert each change, measure user experience impact
- Server-side client control: ability to force clients to download config from server; important tool against abusive client behavior
- Overload behavior and load testing: bring the service to its limits; observe how the service AND surrounding services respond
Launch reviews (also called Production Reviews) became common practice days to weeks before launch
LCE team was Google's solution to achieving safety without impeding change

Part IV: Management

Chapter 28 — Accelerating SREs to On-Call and Beyond

Successful SRE teams are built on trust: trusting teammates to know the system, diagnose atypical behavior, reach out for help, and react under pressure
There is no single style of education that works best; you need to develop course content specific to your team's systems and culture
Recommended training patterns vs anti-patterns:
- Concrete sequential learning experiences vs menial work and trial-by-fire
- Encouraging reverse engineering, statistical thinking, first principles vs training strictly through manuals and checklists
- Celebrating analysis of failure through postmortems vs treating outages as secrets
- Contained but realistic breakages to fix vs encountering a problem for the first time during live on-call
- Roleplaying disasters vs creating subject-matter-expert silos
- Shadowing early in rotation vs pushing into primary before holistic understanding is achieved
Training activities should be appropriately paced; any type of structured training is better than random tickets and interrupts
Starting point for learning the stack: how does a request enter the system? How is the frontend served? How is load balancing/caching set up? What are typical debugging, escalation, and recovery procedures?
On-call learning checklist: lists frontend apps, backend dependencies, SRE experts, developer contacts, and critical knowledge to internalize (clusters, rollback procedures, critical paths); not a playbook — focuses on expert contacts and must-know knowledge
Tiered access model: start with read-only access, progress to write access ("powerups" on the route to on-call)
Good starter project patterns: make a trivial user-visible feature change end-to-end, add monitoring for a blind spot you found, automate a pain point
Five practices for aspiring on-callers:
1. Read and share postmortems; collect best ones prominently for newbies; use them for Wheel of Misfortune rehearsals; "the most appreciative audience of a postmortem is an engineer who hasn't yet been hired"
2. Disaster roleplaying (Wheel of Misfortune): 30–60 minute session, primary and secondary attempt root cause, GM can intervene with details to keep it moving
3. Break real things, fix real things: divert one instance from live traffic, try to break it from a known good configuration, observe how upstream and downstream systems respond
4. Documentation as apprenticeship: on-call checklist must be internalized before shadowing; establishes system boundaries and what's most important
5. Shadow on-call early and often: copy alerts to newbie during business hours first; co-author postmortems; use reverse shadowing (senior watches newbie become primary)
Some teams conduct final exams before granting full access; on-call is a rite of passage and should be celebrated

Chapter 29 — Dealing with Interrupts

Operational load categories:
- Pages: production alerts requiring immediate response; always have an associated SLO (minutes); managed by dedicated primary on-call
- Tickets: customer requests requiring action; SLO measured in hours/days/weeks; should not be randomly assigned to team members; processing tickets is a full-time role
- Ongoing operational activities: flag rollouts, answering support questions, time-sensitive inquiries
Metrics for managing interruptions: interrupt SLO / expected response time, number of backlogged interrupts, severity, frequency, number of people available to handle them
Most stressed-out on-call engineers are either dealing with pager volume or treating on-call as a constant interrupt — living in a state of constant interruptability is extremely stressful
Assign a real cost to context switches: a 20-minute interruption while on a project entails two context switches; realistically results in a loss of a couple hours of truly productive work
Polarize time: be clearly in "project mode" or "interrupt mode" — don't try to mix both simultaneously
For any interrupt class where volume is too high for one person, add another person
On-call principles:
- Primary should focus only on on-call work; during quiet times, handle tickets or non-critical interrupt work
- Primary doesn't progress project work; account for this in sprint planning; if there's important project work, don't put that person on-call
- Secondary may do project work; could support primary in high-pager-volume situations by team agreement
Don't spread ticket load across the entire team — it creates context switches for everyone
Ticket handoffs should be done the same way as on-call handoffs
Regularly examine tickets to identify classes of interrupts with a common or root cause

Chapter 30 — Embedding an SRE to Recover from Operational Overload

One way to relieve burden on an overloaded team: temporarily transfer an SRE into the team
- The embedded SRE focuses on improving practices, not just emptying the ticket queue
- One SRE transfer usually suffices
"More tickets should not require more SREs" — remind teams of this; unless complexity rises, headcount should not scale with ticket volume
Identifying kindling (potential crises to address proactively):
- Subsystems not designed to be self-managing
- Knowledge gaps within the team
- Services quietly increasing in importance without being recognized as such
- Strong dependence on "the next big thing" ("the new architecture will change everything — better not do anything now")
- Common alerts not diagnosed by either dev or SRE teams
- Services with complaints but no formal SLO/SLA
- Services where capacity planning always ends at "add more servers"
- Postmortems where the only action items are rolling back the specific change
- Services nobody wants to own (or that devs own one-sidedly)
Phases of embedded SRE engagement:
- Phase 1 — Learn and get context: shadow on-call, understand what prevents the team from improving reliability, identify largest problems and potential emergencies
- Phase 2 — Share context: write a blameless postmortem for the team; sort fires into toil vs. non-toil
- Phase 3 — Drive change: start with the basics — write SLOs if they don't exist; resist the urge to fix kindling yourself; instead, find accomplishable work for anyone on the team, explain usefulness, review their code, repeat; explain your reasoning to build mental models; ask leading questions
- Final phase — After-action report: a "postvitam" explaining critical decisions at each step that led to success
"An SLO is probably the single most important lever for moving a team from reactive ops work to a healthy, long-term SRE focus"
Bad apple theory is flawed: systems with multiple interactions make errors inevitable; success requires establishing proper conditions and teaching sound decision-making principles

Chapter 31 — Communication and Collaboration in SRE

Production meetings: articulate the state of services, boost org awareness; typical agenda covers upcoming production changes, performance metrics, past outages, paging events, issues requiring attention
SRE Tech Lead role: code review, quarterly presentations outlining team strategy, facilitating consensus-building; provides direction for the team
Tech lead vs. manager distinction: tech lead handles most technical management; manager adds performance evaluation and broader organizational responsibilities beyond technical oversight
Clear communication is an operational skill; good meetings need ownership, purpose, and output
Documentation and status writing are part of the job, not peripheral chores; a technically strong team can still be operationally weak if it communicates poorly

Chapter 32 — The Evolving SRE Engagement Model

Production Readiness Review (PRR) phases:
1. Engagement: teams discuss SLOs/SLAs and plan modifications to enhance dependability
2. Analysis: service evaluated against production standards and industry practices
3. Improvements and refactoring: improvements prioritized and negotiated between dev and SRE
4. Training: staff gain knowledge of system architecture and operational procedures
5. Onboarding: progressively transfers responsibilities and ownership of various production aspects
6. Continuous improvement: maintaining established reliability standards over time
PRR helps teams identify what reliability measures a specific service needs, based on its unique characteristics
Early engagement: bringing SRE participation earlier in development allows assessment of business importance and whether a service's scale justifies deep SRE involvement
Sustainable SRE-driven development: codified best practices, reusable components, standardized platforms, and automated systems that enable smarter infrastructure management
SRE capacity is finite; different products have different risk profiles, maturity levels, and engineering cultures; a one-size-fits-all engagement model either wastes scarce expertise or spreads it too thin
Engagement models should make it obvious what a team must do to earn deeper SRE support or graduate from it

Part V: Conclusions

Chapter 33 — Lessons Learned from Other Industries

Four core SRE concepts that parallel mature safety-critical industries: preparedness and disaster training, postmortem culture, automation and reduced operational overhead, structured and rational decision-making
Cross-industry practices relevant to SRE:
- Unwavering organizational commitment to safety protocols
- Meticulous attention to operational details
- Maintaining excess capacity for contingencies
- Regular simulation exercises and hands-on drills
- Comprehensive staff development and credentialing programs
- Thorough upfront investigation of system specifications and architectural planning
- Layered protective measures against failures
Software culture is often too eager to believe it invented operational seriousness; fields like aviation, medicine, and manufacturing have been managing risk, human factors, and procedure for much longer
The best reliability thinking is interdisciplinary; steal aggressively from mature safety disciplines rather than reinventing operational ideas within the boundaries of tech culture

Chapter 34 — Conclusion

Reliability is not a collection of tricks; it is a way of making trade-offs visible, encoding operational knowledge in systems, and building organizations that can change safely
Error budgets, toil reduction, actionable alerting, blameless postmortems, graceful overload handling, and sustainable on-call are operating principles, not trends
The book rejects two bad extremes: the fantasy that reliability can be reduced to process theater alone, and the fantasy that brilliant engineers can improvise their way through production forever
The answer is engineering discipline, measurement, sane incentives, and a bias toward simplicity