Part I: Introduction
Chapter 1 — Introduction
- SRE is what happens when you ask a software engineer to design an operations team
- Development teams want to launch features; ops teams want stability — because most outages come from changes, these goals are fundamentally in tension; SRE is the resolution
- Google places a 50% cap on aggregate "ops" work (tickets, on-call, manual tasks) for all SREs; it's an upper bound, not a target
- Excess ops work is redirected back to product teams until load drops below 50%
- Time spent on ops is tracked; remaining time goes to project/engineering work
- SREs should receive on average a maximum of two events per 8–12 hour on-call shift; consistently fewer than one per shift is also a waste
- DevOps vs SRE: DevOps is a generalization of SRE principles to a wider range of organizations; SRE is a specific implementation of DevOps with some idiosyncratic extensions
- SRE team is responsible for: availability, latency, performance, efficiency, change management, monitoring, emergency response, capacity planning
- Postmortems should be written for all significant incidents regardless of whether a page was triggered
- Postmortem goal: document what happened, find all contributing causes, assign corrective actions
- Google operates under a blame-free postmortem culture
- 100% is the wrong reliability target for basically everything; no user can tell the difference between 100% and 99.999%
- Once an availability target is set, the remaining tolerance is the error budget
- Error budget can be spent on anything — features, experiments, rollouts
- Error budget removes the structural conflict of incentives between dev and SRE; outages become expected parts of the innovation process, not catastrophes
- Reliability is a function of MTTF and MTTR; Google observed 3x improvement in MTTR when playbooks were introduced vs improvisation
- ~70% of outages are caused by changes to a live system; best practices: progressive rollouts, fast detection, safe rollbacks
- Three valid outputs of a monitoring system: alerts (immediate human action needed), tickets (eventual action needed), logs (no action needed)
- Capacity planning requires accurate organic demand forecasts, incorporation of inorganic demand, and regular load testing
- A slowdown in a service equates to a loss of capacity; provision to meet a capacity target at a specific response speed
Chapter 2 — The Production Environment at Google, from the Viewpoint of an SRE
- DC topology: machines → racks → rows → clusters → datacenter → campus
- Homogeneous environments, common deployment patterns, shared storage, and shared scheduling are what make reliability possible at scale
- Every service depends on other services; those dependencies define the failure modes
- Platform standardization is reliability work — it is one of the main ways SRE scales impact without scaling headcount
Part II: Principles
Chapter 3 — Embracing Risk
- Cost of reliability has two dimensions: cost of redundant resources, and opportunity cost of engineering time spent on risk-reduction instead of features
- SREs see risk as a continuum; a target is both a minimum and a maximum — you want to exceed it, but not by much
- Time-based availability:
uptime / (uptime + downtime); e.g., 99.99% target = max ~52 minutes downtime per year - Aggregate availability:
successful requests / total requests; e.g., 2.5M reqs/day at 99.99% target = max 250 errors/day - Factors to consider when assessing a service's risk tolerance: required availability level, impact of different failure types, cost of the service, other relevant metrics (latency, etc.)
- Things to consider when setting an availability target: what will users expect, is the service tied to revenue, is it paid or free, what do competitors offer, is it consumer or enterprise?
- Cost calculation example: improving 99.9% → 99.99% on a $1M/year service has a value of ~$900; only worthwhile if the improvement costs less
- Error budget formation: product management defines an SLO; actual uptime is measured by monitoring; the gap is the error budget
- As long as budget remains, new releases can be made
- Error budget can block deployments temporarily to pressure reliability focus
- SREs must have authority to stop launches if budget runs out
- Sometimes an SLO has to be loosened to allow more innovation
- Typical factors causing dev/SRE tension: software fault tolerance, testing depth, push frequency, canary duration and size — error budgets make this balance data-driven instead of political
Chapter 4 — Service Level Objectives
- SLI (Service Level Indicator): quantitative measure of some aspect of service level (latency, error rate, throughput, availability, durability)
- SLO (Service Level Objective): target value or range for an SLI; structure is
SLI ≤ targetorlower bound ≤ SLI ≤ upper bound- Without explicit SLOs, users form their own beliefs about performance — leading to over-reliance or under-reliance
- SLA (Service Level Agreement): explicit or implicit contract with consequences for missing SLOs; easy test — "what happens if the SLO isn't met?" — if nothing, it's an SLO not an SLA
- SLIs by system type:
- User-facing: availability, latency, throughput
- Storage: latency, availability, durability
- Big data: throughput, end-to-end latency
- All systems: correctness
- Most metrics are better thought of as distributions, not averages
- 99th percentile shows plausible worst-case; 50th percentile shows typical case
- High variance in response times affects user experience disproportionately; some teams focus only on high percentile values
- Metrics averaged per minute can hide bursts
- Collect client-side metrics when possible; not measuring at the client misses problems that don't show up server-side
- Chubby example: Chubby was so reliable that teams stopped designing for its absence; solution was to take it down deliberately when it was too far above its SLO for the quarter
- Tips for choosing SLO targets:
- Don't pick a target based on current performance — you might be supporting a system that requires heroic effort
- Keep it simple; complicated aggregations obscure changes
- Avoid absolutes (YAGNI)
- Have as few SLOs as possible; defend each one you pick
- Perfection can wait — start loose and tighten, not the other way around
- Keep a safety margin: use internal SLOs (stricter) and external SLOs (looser); don't advertise your internal target externally
- Don't overachieve: users become dependent on over-performing services; deliberately throttle or take the system offline occasionally to avoid over-reliance
- SLOs should specify how they're measured and conditions under which they're valid, e.g., "99% of GET RPC calls will complete in
< 100msaveraged over 1 minute across all backend servers" - Error budget is effectively an SLO for meeting other SLOs; track it daily/weekly; upper management typically looks at monthly or quarterly
Chapter 5 — Eliminating Toil
- Toil is work tied to running a production service that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth
- Overhead (admin work not tied directly to running a service) is different from toil
- If human judgment is essential, it might not be toil
- If the service remains in the same state after you finish the task, it was probably toil
- Typical SRE activities: software engineering, systems engineering, toil, overhead
- Top sources of toil: non-urgent service-related messages/email, on-call response, releasing
- Every SRE needs to spend at least 50% of their time on engineering work (when averaged over a few quarters — toil tends to be spiky)
- If individual SREs report excessive toil, it's a signal for managers to redistribute load more evenly and help those SREs find engineering projects
- Toil in small amounts can be tolerable; some people gravitate toward it
- Toil becomes toxic in large quantities — leads to career stagnation, burnout, boredom, sets a precedent for loading more toil onto SREs, promotes attrition, and causes breach of faith with new hires promised project work
Chapter 6 — Monitoring Distributed Systems
- Monitoring: collecting, processing, aggregating, and displaying real-time quantitative data about a system
- White-box monitoring: based on metrics exposed by internals (logs, JVM profiling, HTTP handlers); Black-box monitoring: testing externally visible behavior as a user would see it
- Why monitor: analyze long-term trends, compare over time or experiment groups, alerting, build dashboards, conduct ad hoc retrospective analysis
- Never trigger an alert simply because "something seems a bit weird" (security auditing on very narrow scopes is an exception)
- When pages occur too frequently, engineers second-guess, skim, or ignore alerts — including real ones masked by noise
- Avoid "magic" systems that try to learn thresholds or automatically detect causality (rules detecting unexpected changes in end-user request rates are a valid counter-example)
- Complex dependency hierarchies ("if DB is slow, alert for DB; otherwise alert for website") only work for very stable system parts
- Four golden signals:
- Latency: time to service a request; distinguish latency of successful requests vs. failed ones; increases in latency are an early indicator of saturation
- Traffic: how much demand is being placed on the system (requests/s, broken out by request type)
- Errors: explicit (500s), implicit (200 with wrong content), or by policy (response over 1s = error if you've committed to 1s)
- Saturation: how "full" the service is; emphasizes the most constrained resource; many systems degrade before 100% utilization so having a utilization target is essential; also covers predictions of impending saturation ("DB will fill in 4 hours")
- If you measure all four golden signals and page when one is problematic, your service will be at least decently covered
- For tail latency: collect request counts bucketed by latencies (histograms) rather than raw latencies
- Alerting rules for humans should be simple, predictable, reliable, and represent a clear failure
- Questions to ask before creating an alert:
- Does this detect an otherwise-undetected condition that is urgent, actionable, and user-visible?
- Will I ever be able to ignore this alert knowing it's benign?
- Can I take action in response? Is that action urgent, or can it wait until morning?
- Are other people already being paged for this?
- On pages: every page should be actionable; every page response should require intelligence; pages with rote algorithmic responses are a red flag; pages should be about novel problems
- Spend more effort catching symptoms than causes; only worry about definite, imminent causes
- In Google's experience: simple collection + aggregation + alerting + dashboards works well; add complexity only when needed
- Periodical reviews of page frequency done with management in quarterly reports (target: a couple of pages per shift)
- Often, sheer heroic effort can achieve high availability short-term — but a controlled short-term hit is usually a better long-run trade than sustained burnout
Chapter 7 — The Evolution of Automation at Google
-
"Automation is a force multiplier, not a panacea"
-
Automation is meta-software: software to act on software
-
Doing automation thoughtlessly can create as many problems as it solves
-
Value of automation: consistency (very few humans act with equal consistency every time), platform extensibility, reduced MTTR, faster non-repair actions (failovers, traffic switching), time savings — decoupling the operator from the operation is powerful
-
Hierarchy of automation maturity:
- No automation
- Externally maintained, system-specific automation (script in an SRE's home folder)
- Externally maintained, generic automation (documented for the team)
- Internally maintained, system-specific automation (versioned to the system's repo)
- Systems that don't need any automation (the goal)
-
Infrequently run automation is fragile
-
Relieving teams from ops responsibility can remove their incentive to reduce tech debt; product managers not affected by low-quality automation will always prioritize new features
-
Automation failure risk: when automation covers more and more daily activities, human operators lose direct contact with the system; when automation fails, humans may no longer be able to operate it — this is unavoidable in sufficiently autonomous systems, and must be accounted for
-
Case study (Ads Database): failovers automated, outage no longer paged a human; total operational maintenance cost dropped ~95%; up to 60% of hardware utilization freed
-
Case study (Cluster turnups): early automation was an initial win, but free-form scripts became technical debt; Prodtest (Python unit test framework extended for real-world services) created a chain of tests that could validate a service's configuration across all clusters
Chapter 8 — Release Engineering
- Release engineering is a distinct discipline: release engineers work with SWEs and SREs to define how software is released, from version control through compilation, testing, packaging, and deployment
- High velocity models: some teams do hourly builds with deploy based on test results; others use "push on green" (every build that passes tests goes to production)
- Hermetic builds: building the same revision always produces identical results; self-contained including the compiler version; allows cherry-picking fixes against old revisions to fix production software
- All code lives in the main branch; releases are branched off; fixes flow from main to the release branch via cherry-pick; branches never merge back
- CI creates an audit trail: tests ran, tests passed
- Config management: deceptively simple, a major source of instability; all schemes should involve source control and strict review
- Option: use mainline for config (decouples binary releases from config changes)
- Option: include config files in the same package as the binary (simple, but tightly coupled)
- Option: package config separately using the same hermetic principle as code
- Six gated operations requiring approval: source code change, release process action definitions, new release creation, integration proposal approval, release deployment, build configuration modifications
- Packages are named, versioned with a unique hash, and signed
- Budget for release engineering resources at the beginning of the product development cycle — it's cheaper to do it now than later
- Common questions every team needs to answer: how to handle package versioning, CI or CD, release cadence, config management policies, release metrics
Chapter 9 — Simplicity
- "At the end of the day, our job is to keep agility and stability in balance in the system"
- "The price of reliability is the pursuit of the utmost simplicity"
- "Unlike a detective story, the lack of excitement, suspense, and puzzles is actually a desirable property of source code"
- Reliable systems can increase agility: reliable rollouts make it easier to link changes to bugs
- Essential vs accidental complexity: SREs should push back when accidental complexity is introduced
- Code is a liability, not an asset; remove dead code and other bloat
- Commented-out code is an anti-pattern; forever-gated feature flags are an anti-pattern (flags should be actively rehearsed and removed)
- Smaller APIs are easier to test and more reliable; avoid misc/util classes
- Small releases are easier to debug and measure; you can't tell what happened if 100 changes were released together
- No monitoring: you're blind; SREs don't go on-call for the sake of it, they do it to stay in touch with how systems work and fail
- Managing incidents effectively reduces impact and limits outage-induced anxiety; blameless postmortems are the first step to understanding what went wrong
Part III: Practices
Chapter 10 — Practical Alerting
- Monitoring a very large distributed system presents challenges: vast number of components, need for low maintenance burden
- Borgmon (Google's internal monitoring system, conceptually similar to Prometheus): a programmable calculator with syntactic sugar for generating alerts using a common data exposition format
- Time-series data: conceptually a 2D array with time on one axis and items on the other; each series named by a unique set of labels (
name=value)- Data points are
(timestamp, value)stored in chronological lists - Data stored in-memory, checkpointed to disk; fixed-size allocation; oldest entries GC'd when full
- Data points are
- Counters: monotonically non-decreasing variables (km driven, request count); preferred over gauges because they don't lose meaning when events occur between sampling intervals
- Gauges: any value, doesn't have to be monotonically shifting (fuel remaining, current speed)
- Labels serve three purposes: define breakdowns of the data itself, define the source of the data (service name, container), indicate locality or aggregation (zone, shard)
- Alertmanager: can inhibit certain alerts when others are active, deduplicate alerts from multiple instances with same labelsets, fan-in or fan-out alerts based on labelsets
- White-box monitoring (Borgmon/Prometheus): sees system internals; Black-box monitoring (Prober): looks at system from outside, monitors what the user sees
- Page-worthy alerts go to on-call rotation; non-page-worthy alerts go to a separate processing queue or as informational data — this distinction is highlighted repeatedly as an Important Detail
- Rules generating alerts for humans should be simple, represent clear failures, and require intelligence to respond to
Chapter 11 — Being On-Call
- On-call = being available to step in, reacting within a specific time bound (minutes or hours depending on SLA)
- Typical response times: 5 minutes for user-facing or time-critical tasks; 30 minutes for less time-sensitive
- When a page arrives: acknowledge, triage, escalate if necessary
- Non-paging events are less urgent but on-call engineers should vet them during business hours
- Primary and secondary roles: primary for pages; secondary falls through for primary, handles non-paging events, escalation support
- Balancing on-call quantity: SREs should spend ≥50% of time on engineering; of remaining time, no more than 25% on on-call
- Balancing on-call quality: incident handling (root cause, remediation, postmortem, bug fix) takes ~6 hours; therefore max ~2 incidents per 12-hour shift; strive for a very flat distribution of pages with median 0
- Night shifts degrade health; multi-site teams eliminate night shifts by following the sun; caveat: significant communication and handoff overhead
- SRE managers must keep quantity and quality balanced
- Most important on-call resources: clear escalation paths, well-defined incident management procedures, blameless postmortem culture
- All paging events should be actionable; silencing noisy non-actionable alerts reduces fatigue
- If there's more than one alert for one incident, strive for a 1:1 alert-to-incident ratio
- If too many pages occur, give the pager to the developers owning those services and work with them until alert quality returns to standard; feature development halts until this is resolved
- Compensation: extra pay or time-off for on-call shifts; capped at a proportion of salary to incentivize involvement while limiting burnout
- Operational underload is also a problem: teams should be sized so every engineer is on-call once or twice a quarter, to stay in touch with production
- Wheel of Misfortune helps hone SRE capabilities
- Two modes of thinking under pressure: intuitive/rapid action vs rational/deliberate cognitive function; the latter leads to better outcomes during incidents
- SRE teams can be loaned to overloaded teams temporarily; measure overload symptoms (daily tickets, paging events per shift) explicitly
Chapter 12 — Effective Troubleshooting
- Troubleshooting is the application of the hypothetico-deductive method: iterate hypotheses until one holds
- Troubleshooting is learnable; expertise comes from investigating failures, not just understanding normal operation
- Ideally a problem report gives the top-level symptom; start drilling down into telemetry and logs, narrow down culprits, exclude parts of the system (bisection is a useful tactic), identify contributing factors
- Two ways to test hypotheses: compare observed state against theory to find confirming/disconfirming evidence, or treat the system (change something in a controlled way and observe)
- Common troubleshooting pitfalls:
- Looking at irrelevant symptoms (wild goose chases)
- Misunderstanding system dynamics (inputs, behavior, outputs)
- Coming up with wildly improbable theories
- Hunting down spurious correlations and coincidences
- Confusing correlation with causation
- Always prefer simple explanations; the four golden signals are useful scaffolding for building simple explanations
- An effective problem report contains: expected behavior, reproduction steps, consistent form, and exists somewhere searchable
- Stop the bleeding first — make the software work before investigating root cause; preserve earlier evidence of the incident for later
- Structured logs are important for retrospective analysis; pass trace IDs using a common standard through all layers
- Design systems with well-understood and observable interfaces between components; observability-driven engineering makes troubleshooting sessions dramatically shorter
Chapter 13 — Emergency Response
- Don't panic: you're not alone, the sky is not falling, nobody is dying; if you feel overwhelmed, pull in more people (sometimes everyone has to be paged)
- Follow an incident response process; not following it is itself a contributing cause of incidents
- Three types of emergencies:
- Test-induced: planned, proactive ways to break production; failures are controlled and aborted when things go wrong
- Change-induced: incident stems from deployment or configuration changes
- Process-induced: incident caused by a process (usually automated) that wreaks havoc (e.g., automation that wipes hard drives)
- Lessons from test-induced emergency example:
- Nobody really understood how two systems interacted — review hadn't been good enough despite many eyes
- Incident response process was not followed, which prevented wider awareness
- Rollback procedures had not been rehearsed in a test environment — they were broken
- Now: rollback procedures are tested before any large-scale test
- "All problems have solutions — a solution exists, even if it's not obvious, especially to the person whose pager is screaming"
- Involve the person whose actions triggered the incident; they know the most and change-induced emergencies are typically fixed faster with them involved
- Keep a history of outages: ask hard questions, look for strategic (not just tactical) preventive actions, publish postmortems somewhere everyone can read them, hold people accountable to follow-up actions
- Until a system has failed, you don't know how it, its upstream systems, or its users will react — don't assume
Chapter 14 — Managing Incidents
- Recursive separation of responsibilities: delegate distinct roles with clear boundaries
- Incident Commander: structures the response task force, assigns responsibilities, holds all undelegated roles; most important task is maintaining a living incident document (war room)
- Ops Lead: works with incident command; only person modifying the system during the incident
- Communications Lead: public face of the task force; provides periodic updates to the team and stakeholders
- Planning Lead: handles longer-term issues — bug filing, arranging dinner, tracking handoffs, recording how the system has diverged from normal so it can be reverted
- A single war room (physical or virtual) is recommended; incident command handoffs must be done loudly and explicitly with explicit acknowledgment from all participants
- When to declare an incident (declare early rather than late): do you need a second team? Is the outage customer-visible? Has the issue gone unsolved after an hour of concentrated effort?
- Incident management proficiency degrades when not in regular use
- Best practices:
- Prioritize: stop the bleeding, restore service, preserve evidence for the postmortem
- Prepare: develop and document procedures in advance with incident participants
- Trust: give full autonomy within each assigned role
- Introspect: if you feel panicky or overwhelmed, get more support
- Consider alternatives: periodically re-evaluate whether the current approach still makes sense
- Practice: use the process routinely so it becomes second nature
- Rotate roles: encourage every team member to gain familiarity with each role
Chapter 15 — Postmortem Culture: Learning from Failure
- "The cost of failure is education"
- Postmortem definition: a written record of an incident, its impact, the actions taken to mitigate or resolve it, the contributing causes, and follow-up actions to prevent recurrence
- Primary goals: document the incident, understand all contributing causes, take preventive actions to reduce likelihood and/or impact of recurrence
- Writing a postmortem is not a punishment; it's a learning opportunity; any stakeholder may request one
- Blamelessness:
- Must not indict any individual or team for bad or inappropriate behavior
- Assumes everyone had good intentions and did the right thing with the information they had at the time
- When done well, leads to investigating why individuals had incomplete/incorrect information
- When done badly, leads to finger-pointing and shaming — and, critically, to people hiding information in future incidents
- Postmortem review criteria used at Google: was key incident data collected? Are impact assessments complete? Is the action plan appropriate? Are resulting bug fixes at appropriate priority? Did we share the outcome with relevant stakeholders?
- An unreviewed postmortem might as well never have existed
- Tools for introducing postmortem culture: postmortem of the month newsletter, postmortem reading clubs (regular sessions where impactful postmortems are read aloud), Wheel of Misfortune (re-enact a previous postmortem with the original incident commander present)
- Make writing effective postmortems a rewarded and celebrated practice; even senior leadership should acknowledge and participate (book mentions Larry Page talking about the value of postmortems)
- Ask for feedback on effectiveness: is the culture supporting your work? Does writing one entail too much toil? What tools would you like to see?
Chapter 16 — Tracking Outages
- Postmortems provide useful insights for individual services but can miss opportunities with small per-service impact but large horizontal impact
- The Escalator: Google's in-house PagerDuty equivalent; centralized system tracking ACKs to alerts, notifies others if necessary
- The Outalator: time-interleaved view of notifications for multiple queues at once; allows annotating incidents, marking annotations as important, silently saving email replies, and combining multiple escalating notifications into a single incident entity
- A single event often triggers multiple alerts; the ability to group multiple alerts into a single incident is critical
- Track outages with consistent definitions, user impact, duration, and cause categories; this makes reliability visible enough to influence prioritization
Chapter 17 — Testing for Reliability
- "If you haven't tried it, assume it's broken"
- Confidence comes from both past reliability and future reliability; for future predictions to hold, either the system remains completely unchanged or you can confidently describe all changes
- Passing tests doesn't prove reliability; failing tests generally prove its absence
- Zero MTTR: a system-level test that detects exactly the same problem monitoring would detect; repairing these bugs by blocking a push is quick and convenient
- The more bugs caught pre-production (zero MTTR), the higher the MTBF
- Test types:
- Unit tests: smallest/simplest; assess a single unit of software (class, function) independent of the larger system
- Integration tests: assembled component verification; use dependency injection and mocks to test components in isolation
- System tests: largest scale for undeployed systems; end-to-end functionality
- Smoke tests: simple but critical behavior; short-circuit additional expensive testing
- Performance tests: check performance stays acceptable over the lifecycle
- Regression tests: prevent known bugs from sneaking back; gallery of rogue bugs
- Stress tests: find the limits of a web service
- Canary tests: a subset of servers upgraded to a new version/config and left in incubation; not really a test, it's structured user acceptance
- Canary tests: not necessary to achieve fairness among fractions of user traffic when using exponential rollout
- CI/CD: optimal when engineers are notified when the build pipeline fails; deblocking pipelines should always be the first priority
- Config files that change more often than once per application release are a major reliability risk if those changes aren't treated the same as application releases
- Config file contents are potentially hostile to the interpreter reading them — a potential security threat vector
- Disaster recovery tools should work "offline" using checkpoint states; they're expected to work with instant consistency, not eventual consistency
- Statistical techniques like fuzzing and chaos testing aren't necessarily repeatable; improve repeatability using seeded random number generators
- Key element of site reliability: find each anticipated form of misbehavior and make sure some test reports it
- SRE tools need to be tested too (tools that retrieve/propagate metrics, predict usage, plan capacity)
Chapter 18 — Software Engineering in SRE
- Growth rate of SRE-supported services exceeds the growth rate of the SRE organization; one SRE guiding principle is that team size should not scale directly with service growth
- Team diversity is critical: a mix of traditional software development and systems engineering backgrounds helps prevent blind spots
- Intent-based capacity planning: specify requirements (intent), not implementation; encode them, autogenerate the allocation plan
-
Ladder of increasingly intent-based planning:
- "I want 50 cores in clusters X, Y, Z" — why those?
- "I want 50 cores in any 3 clusters in region" — why 50, why 3?
- "I want to meet demand with N+2 redundancy" — why N+2?
- "I want 5 nines of reliability" — could find N+2 insufficient
-
Greatest gains from going to level 3; some sophisticated services go to level 4
-
- Auxon case study (Google's intent-based capacity planning tool):
- Built by an SRE who was managing capacity in spreadsheets, then formalized into a full product with backlog, SLA, team ownership
- Inputs: requirements/intent, performance data, demand forecasts, resource supply and pricing
- Uses a mixed-integer or linear programming solver
- Key learnings: don't focus on perfection, launch and iterate; a single email doesn't drive adoption — it needs consistent approach, user advocacy, and senior sponsorship; small releases build confidence; don't over-customize for a few big users; avoid 100% adoption rate; "seed team" of generalists + deep-expertise engineers works well
- SRE software must be designed for scalability, graceful degradation on failure, and easy integration with other infrastructure
- Good candidate SRE projects: reduce toil, improve existing infrastructure, streamline a complex process, and must fit org-wide objectives
- SREs who build products should continue working as SREs rather than becoming embedded developers — they dogfood the tools and bring an invaluable operational perspective
- Guidelines for building SRE software: create a clear message and communicate benefits (SREs are skeptical), evaluate org capabilities, launch and iterate to establish credibility, hold yourself to the same standards as a product team
Chapter 19 — Load Balancing at the Frontend
- DNS is typically the first layer of load balancing; conceptually simple but many dragons exist
- Very little control over client behavior; records selected randomly
- DNS server acts as a caching layer: recursive resolution makes it difficult to find the optimal IP for a given user; responses are cached with TTL
- DNS alone is insufficient; not the right solution for fine-grained control
- Better approach: DNS combined with Virtual IP (VIP) addresses
- Network Load Balancer sits in front; receives packets and routes to available backends
- Consistent hashing: mapping algorithm that remains stable when backends are added or removed, minimizing disruption to existing connections
- Simple connection tracking as default; fall back to consistent hashing under pressure
- Packet forwarding strategies:
- NAT: assumes a completely stateless fallback mechanism
- Direct Server Response (DSR) (layer 2 modification): all LBs and backends must be reachable at the data link layer; Google stopped using this
- Packet encapsulation (GRE): Google started using this; introduces overhead (~24 bytes for IPv4+GRE) that can exceed MTU and require fragmentation
- Load balancer should always prefer redirecting to the least loaded backend
Chapter 20 — Load Balancing in the Datacenter
- Backend service: typically 100–1000 processes; ideal goal is perfectly distributed load
- Lame duck state: backend task is listening on its port and can serve, but explicitly asks clients to stop sending new requests; broadcasts this state to all active clients
- Main advantage: simplifies clean shutdown; avoids serving errors to requests active on shutting-down tasks
- Shutdown sequence: scheduler sends SIGTERM → task enters lame duck → clients redirect new requests elsewhere → ongoing requests complete → task exits cleanly (or is killed)
- Traffic sinkholing: a client sends very large amounts of traffic to an unhealthy task because the unhealthy task returns errors with very low latency, causing the client to increase request rate
- Fix: tune load balancer to count recent errors as active requests
- If outgoing request latency grows (e.g., competition for network resources from a noisy neighbor), active request count also grows — can trigger GC
- When a task restarts, it often requires significantly more resources for a few minutes (initialization cost, cold cache, JIT warmup)
- Subsetting: clients interact with a limited subset of backends (typically 20–100); reduces connection overhead while maintaining health checking
- Subset selection algorithms: random (bad utilization) → round-robin (permuted order) → deterministic subsetting (each backend assigned to exactly one client per round)
- Load balancing policies:
- Round-robin: 2x difference observed between most and least loaded in practice
- Least-loaded round-robin: rounds among least-loaded; load measured by connection count; still suboptimal since it's per-client, not global
- Weighted round-robin: clients maintain capability scores per backend; backends report query rate, error rate, utilization in responses; clients adjust scores periodically; best distribution in practice — recommended
Chapter 21 — Handling Overload
- Gracefully handling overload is fundamental to running a reliable service
- Strategy: redirect when possible, serve degraded results when necessary, handle resource errors transparently when all else fails
- QPS is a poor capacity metric because different queries have vastly different resource costs; better to measure capacity in available resources (CPU time per request is a good normalized measure)
- When global overload occurs: deliver errors to misbehaving customers, other customers remain unaffected; reject out-of-quota requests quickly
- Client-side throttling: when most CPU is spent rejecting requests, throttle on the client side
- Adaptive throttling: each client tracks two values over a 2-minute window:
requests(attempted) andaccepts(accepted by backend); oncerequests = K × accepts, client stops sending; this leads to stable overall request rates in practice - Request criticality (Google's four tiers):
- CRITICAL_PLUS: reserved for most critical; serious user-visible impact if they fail
- CRITICAL: default for production jobs; will cause user-visible impact; services must provision capacity for CRITICAL_PLUS + CRITICAL traffic
- SHEDDABLE_PLUS: partial unavailability expected; default for batch jobs
- SHEDDABLE: frequent partial and occasional full unavailability expected
- Criticality propagates through RPC calls (same criticality level is used for all upstream calls)
- Only reject requests of a given criticality if already rejecting all requests of lower criticalities
- Overload protection at Google is based on utilization (CPU rate / total CPUs reserved, executor load average, combined target thresholds); as threshold is reached, requests are rejected based on criticality
- Overload errors: if large subset of DC is overloaded, don't retry (errors should bubble up); if small subset is overloaded, prefer immediate retry
- Request retries: from the load balancer's perspective, retries are indistinguishable from new requests; can be organic load balancing
- Per-request retry budget (max 3 at Google)
- Per-client retry budget (track retry ratio; only retry if below ~10%)
- Return "overloaded; don't retry" error response when a histogram reveals significant retry volume
- Consider having a server-wide retry budget
- Handling burst load: expose load to cross-datacenter load balancing algorithm; use a separate proxy backend for batch jobs to shield fan-outs from user-facing services
- Common mistake: assuming an overloaded service should turn down and stop accepting all traffic; instead, accept as much as possible and only shed load as a last resort
Chapter 22 — Addressing Cascading Failures
- "If at first you don't succeed, back off exponentially." + "Why do people always forget to add a little jitter?"
- Cascading failure: failure that grows over time as a result of positive feedback; can occur when part of a system fails, increasing the probability that other parts also fail
- Most common cause: overload
- Resource types that can be exhausted: CPU, memory, threads, file descriptors, dependencies among resources
- CPU exhaustion secondary effects: increased in-flight requests, longer queues, thread starvation, reduced CPU cache benefits, health check failures
- Memory exhaustion secondary effects: dying tasks, increased GC rate in Java (GC death spiral: less CPU → slower requests → increased RAM usage → more GC → even less CPU), reduced cache hit rates, thread starvation
- Thread starvation: can directly cause errors, health check failures; if threads added without upper bound, thread overhead uses too much RAM; also risks running out of process IDs
- File descriptor exhaustion: inability to initialize network connections → health check failures
- Load balancing policies that avoid servers serving errors exacerbate problems (snowball effect on remaining servers)
- Strategies for avoiding server overload: load test capacity limits and test failure mode for overload; serve degraded results; instrument to reject requests when overloaded; have higher-level systems reject requests (reverse proxy, load balancer, task); capacity planning
- Queue management: keep queue size ≤50% of thread pool size for steady-load services; for bursty workloads, size based on thread count, processing time, and burst size and frequency; consider LIFO queuing or controlled delay (CoDel) algorithm
- Load shedding: drop a proportion of traffic as server approaches overload; per-task throttling based on CPU, memory, or queue length; return 503 when too many requests are in-flight
- Graceful degradation: reduce amount of work (search in-memory cache instead of DB); keep the degradation path simple; test it regularly (a code path you don't exercise will be broken); design a way to turn it off
- Retry guidelines: always use randomized exponential backoff; limit retries per request; avoid retrying at multiple levels (amplifies load catastrophically); separate retriable from non-retriable errors; return a specific "overloaded; don't retry" status; server-wide retry budgets
- RPC deadlines: essential to prevent zombie requests consuming resources; propagate deadlines top-down (all downstream RPCs share the same absolute deadline); set an upper bound on outgoing deadlines; deadlines several orders of magnitude longer than mean request latency are usually bad; check deadline before continuing at each processing stage
- Propagate cancellations: notify servers in the call chain that their work is no longer needed; some systems "hedge" requests and cancel the rest when one responds
- Cold start issues: processes are slower right after starting (initialization, JIT, deferred class loading, cold cache); when adding load to a cluster, increase gradually
- Always go downward in the stack; avoid intra-layer communication; communications within a layer are susceptible to distributed deadlocks
- Triggering conditions for cascading failures: process death (Query of Death, assertion failures), process updates, new rollouts (config or infra changes), organic growth (usage exceeded capacity estimate), planned drains or turndowns
- Depending on slack CPU as a safety net is dangerous; ensure load tests stay within committed resource limits
- Testing for cascading failures: test until it breaks; consider both gradual and impulse load patterns; test each component separately; track state between interactions; be careful testing in production
- Immediate steps to address cascading failures: increase resources (may not be sufficient alone), stop health check failures, restart servers (especially GC death spirals or deadlocks), drop traffic (last resort — let 1% through only), enter degraded mode, eliminate batch load, remove bad traffic
Chapter 23 — Managing Critical State: Distributed Consensus for Reliability
- Distributed consensus problem: reaching agreement among a group of processes connected by an unreliable network — one of the most fundamental concepts in distributed computing
- Questions requiring consensus: who is the leader? What processes are in the group? Has a message been committed to the queue? Does a process hold a lease? What is the value for a key?
- Whenever you see leader election, critical shared state, or distributed locking: use distributed consensus systems that have been formally proven and tested — don't roll your own
- CAP theorem: a distributed system cannot simultaneously have all three of: consistent views of data at each node, availability of data at each node, tolerance to network partitions
- FLP impossibility: no asynchronous distributed consensus algorithm can guarantee progress in the presence of an unreliable network; mitigated by having sufficient healthy replicas and network connectivity (plus backoff jitter to avoid dueling proposers)
- ACID vs BASE: BASE (Basically Available, Soft state, Eventual consistency) enables higher throughput at the cost of stronger consistency guarantees; eventual consistency can produce surprising results especially with clock drift or network partitions
- Case study 1 (split-brain): heartbeats can't be used to solve leader election; slow or packet-dropping network can cause both nodes to issue kill commands to each other
- Case study 2 (human intervention failover): human escalation scales poorly; if the network is so badly degraded that consensus can't elect a primary, a human is not better positioned to elect one either
- Case study 3 (faulty group membership): gossip-protocol-based cluster formation can lead to split-brain in a network partition, with both sides electing a master and accepting writes
- Consensus algorithms: Paxos (original), Multi-Paxos, Raft, Zab, Mencius
- Paxos overview: sequence of proposals accepted by a majority; each proposal has a unique sequence number (strict ordering); acceptors agree only if no higher sequence number seen; proposer commits by sending value to acceptors when majority agrees; acceptors must journal to disk when accepting; two different values cannot be committed for the same proposal (any two majorities overlap at ≥1 node)
- Replicated State Machine (RSM): executes the same set of operations in the same order on several processes; fundamental building block; any deterministic program can be implemented as highly available service as an RSM
- Timestamps are highly problematic in distributed systems; use distributed consensus for ordering instead
- Barriers: primitives that block a group of processes from proceeding until a condition is met; split distributed computation into logical phases; can be implemented as an RSM (Zookeeper supports barriers)
- Locks should be used with timeouts to prevent deadlocks; supported in RSM
- Queueing-based systems: tolerate failure/loss of worker nodes; use lease systems to ensure claimed tasks are processed; implementing the queue as an RSM makes the system far more robust
- Atomic broadcast: messages received by all participants reliably and in the same order — an incredibly powerful primitive
- Multi-Paxos: strong leader process enables only 1 round trip to reach consensus; backoff jitter and timeouts necessary to avoid dueling proposers
- For read-heavy workloads: read-only consensus operation, read from replica guaranteed to be most up-to-date (stable leader can provide this), or quorum leases (strongly consistent local reads at the cost of some write performance)
- Two physical constraints on performance: network round-trip time and lead time for writing to persistent storage
- Conventional wisdom that consensus algorithms can't be used for high-throughput low-latency is false — proven extremely effective in practice at Google
- Minimum replicas for non-Byzantine failures: 3 (2-node quorums cannot tolerate any failure); adding a replica to a majority quorum can decrease availability
- Monitor consensus systems closely: number and status of replicas, lagging replicas, whether a leader exists, rate of leader changes (too rapid = flappiness, sudden decrease = serious bug), consensus transaction number (is the system making progress?), throughput and latency
- "If you remember nothing else from this chapter": know the problems that distributed consensus can solve and the types of problems that arise from using ad hoc methods like heartbeats
Chapter 24 — Distributed Periodic Scheduling with Cron
- Cron: Unix tool for launching arbitrary periodic jobs at user-defined times or intervals
- Simple cron failure domain is one machine; only state needing persistence across restarts is crontab configuration; launches are fire-and-forget so launch tracking is not needed (exception: anacron-style catch-up for missed launches)
- Cron jobs come in shapes: idempotent (GC, cleanups), side-effectful (email newsletters), time-pressured or not; skipping a cron job is generally better than risking a double run
- Hosting cron on a single machine is a reliability catastrophe; decouple the cron service from machines
- Two options for tracking state in distributed cron: external distributed storage (better for large blobs, but small writes are expensive and high-latency) vs. small local state stored as part of the cron service (no extra dependencies, but risk of data loss)
- Paxos for distributed cron: strong consistency guarantees; leader replica is the only one that actively launches jobs; completion of a launch synced to all replicas; leader election must complete within 1 minute to avoid missing launches
- Every cron job launch has two sync points: when the launch happens and when it finishes — these delimit the launch
- To reduce missed/double launches when a leader dies: all operations should be idempotent (use ahead-of-time known job names), or have observability to see if launch requests all succeeded
- Log compaction: the state change log must be compacted (snapshots work well); can store locally (fast but possible data loss) or externally (not desirable due to small write cost); Paxos helps recover from single-machine log loss via replicas
- Thundering herd problem: many concurrent cron jobs spawning HTTP calls can cause spikes; solution: allow
?in crontab schedule fields so the cron system picks the value randomly, effectively adding jitter
Chapter 25 — Data Processing Pipelines
- Data pipeline pattern: read data → transform → output; historically: co-routines, DTSS communication files, Unix pipes, ETL pipelines
- Simple one-phase pipelines: periodic or continuous transformation on big data; multiphase pipelines chain programs in series — organized for ease of understanding rather than operational efficiency
- Periodic pipelines are generally stable when there are sufficient workers AND execution demand is within computation capacity; fragile when growth and changes strain resources
- "Embarrassingly parallel" algorithms cut workloads into chunks per machine; end-to-end runtime is capped by the largest customer's runtime
- Hanging chunk problem: uneven resource distribution across cluster machines; typical user code waits for total computation to complete — one slow chunk delays everything; responding after detection (e.g., killing the job) can make things worse by restarting from scratch
- Excessive batch scheduler usage places jobs at risk of preemption when cluster load is high (other users starved of batch resources)
- Moiré load pattern: two or more pipelines run simultaneously, their execution sequences occasionally overlap, simultaneously consuming a shared resource; less common when load arrives more evenly; best observed through shared resource usage
- Thundering herd problem in pipelines: thousands of workers starting simultaneously, combined with misconfigured or problematic workers, can overwhelm shared cluster services and networking; naive retry logic compounds the problem; adding more workers when a job fails can compound it further
- Buggy pipelines at scale (10k workers) are always hard on the infrastructure
- Workflow as Model-View-Controller:
- Task Master (Model): holds all job states in memory, synchronously journals mutations to persistent disk; can have task groups corresponding to pipeline stages
- Workers (View): completely stateless and ephemeral; continually update system state transactionally with the master
- Controller (optional): auxiliary activities — runtime scaling, snapshotting, workcycle state control, rolling back pipeline state
- Correctness safeguards: config task barriers, mandatory worker leases, unique output naming, mutual validation via server tokens
- Big data pipelines need to continue processing despite all types of failures
Chapter 26 — Data Integrity: What You Read Is What You Wrote
- Data integrity: whatever users think it is; more formally, the measure of accessibility and accuracy of the datastores needed to provide users with adequate service
- An interface bug causing Gmail to fail to display messages is the same as data being gone — from the user's perspective
- Every service has independent uptime and data integrity requirements, explicit or implicit
- Secret to superior data integrity: proactive detection and rapid repair
- Failure mode dimensions:
- Scope: narrow/directed vs. widespread
- Rate: big-bang event vs. creeping (distributed application logic contributing to a gradual null value over months)
- Study of 19 data recovery efforts at Google: most common user-visible data loss = deletion or referential integrity loss due to software bugs; hardest cases = low-grade corruption discovered weeks or months later
- Replication and redundancy are not recoverability — replicas may contain the same corrupted data; media isolation (tapes) protects from media flaws
- Backups vs. archives: backups can be loaded back into an application; archives safekeep data for auditing, discovery, and compliance (cannot be directly restored to the app)
- When formulating backup strategy: how quickly must you recover (RTO)? How much recent data can you afford to lose (RPO)?
- Defense layers:
- 1st layer — soft/lazy deletion: delay permanent deletion for 15/30/45/60 days; architecture should prevent developers from circumventing it; also use revision history
- 2nd layer — backups: focus is recovery, not just backup; questions: which methods, how frequently, where stored, how long retained, are they valid, does recovery complete in time, do you have monitoring for recovery state; always rehearse using automation
- 3rd layer — data validation: bad data propagates; validate high-impact invariants (not super-strict validation that will be abandoned); ability to drill into validation audit logs is essential; out-of-band validation detects creeping data loss
- Overarching layer — replication: choose a continuously battle-tested popular scheme; not always feasible for every storage instance
- Cloud environment considerations: mixture of transactional and non-transactional backup solutions means recovered data won't necessarily be correct; services evolving without maintenance windows means different business logic versions may act on data simultaneously
- General principles: have a beginner's mind (trust but verify, defense in depth); hope is not a strategy (prove recovery works via automation)
- "Recognize that not just anything can go wrong, but everything will go wrong"
Chapter 27 — Reliable Product Launches at Scale
- Google has a special team called Launch Coordination Engineers (LCEs) within SRE
- Audit products for reliability compliance, liaise between teams, drive technical aspects, act as gatekeepers, educate developers on best practices
- Expected to have strong communication and leadership skills; mediate between disparate parties toward a common goal
- LCEs are incentivized to prioritize reliability over other concerns
- A launch is any new code that introduces an externally visible change; up to 70 launches per week measured at Google
- Advantages of an LCE team: breadth of experience (work across many products, great for knowledge transfer), cross-functional perspective (holistic view, important for complex multi-team/timezone launches), objectivity (nonpartisan advisors between SRE, product devs, PMs, marketing)
- Launch process criteria: lightweight (easy on devs), robust (catches obvious errors), thorough (addresses important details consistently), scalable (handles both simple and complex launches), adaptable
- Tactics to achieve these criteria: simplicity (get the basics right, don't plan every eventuality), high-touch approach (experienced engineers customize per launch), fast common paths (identify common launch patterns and provide simplified processes)
- LCE launch checklist for "launch qualification":
- Each entry answers a question and provides a concrete, practical, reasonable action item
- Each question is there to prevent a past mistake; growth of the list is controlled by rigorous review (top leadership reviews); list is reviewed 1–2 times per year to remove obsolete items
- Infrastructure/tool standardization (Kubernetes, unified logging) simplifies checklists
- Checklist themes: architecture and dependencies, integration with internal ecosystem, capacity planning, failure modes, client behavior, processes and automation, development process, external dependencies, rollout planning
- Selected techniques for reliable launches:
- Gradual and staged rollouts: canary testing, rate-limited signups; almost all updates at Google done gradually
- Feature flag frameworks: roll out to few servers/users, gradually increase to 1–10%, direct traffic by users/sessions/locations, automatically handle failures in new code paths, independently revert each change, measure user experience impact
- Server-side client control: ability to force clients to download config from server; important tool against abusive client behavior
- Overload behavior and load testing: bring the service to its limits; observe how the service AND surrounding services respond
- Launch reviews (also called Production Reviews) became common practice days to weeks before launch
- LCE team was Google's solution to achieving safety without impeding change
Part IV: Management
Chapter 28 — Accelerating SREs to On-Call and Beyond
-
Successful SRE teams are built on trust: trusting teammates to know the system, diagnose atypical behavior, reach out for help, and react under pressure
-
There is no single style of education that works best; you need to develop course content specific to your team's systems and culture
-
Recommended training patterns vs anti-patterns:
- Concrete sequential learning experiences vs menial work and trial-by-fire
- Encouraging reverse engineering, statistical thinking, first principles vs training strictly through manuals and checklists
- Celebrating analysis of failure through postmortems vs treating outages as secrets
- Contained but realistic breakages to fix vs encountering a problem for the first time during live on-call
- Roleplaying disasters vs creating subject-matter-expert silos
- Shadowing early in rotation vs pushing into primary before holistic understanding is achieved
-
Training activities should be appropriately paced; any type of structured training is better than random tickets and interrupts
-
Starting point for learning the stack: how does a request enter the system? How is the frontend served? How is load balancing/caching set up? What are typical debugging, escalation, and recovery procedures?
-
On-call learning checklist: lists frontend apps, backend dependencies, SRE experts, developer contacts, and critical knowledge to internalize (clusters, rollback procedures, critical paths); not a playbook — focuses on expert contacts and must-know knowledge
-
Tiered access model: start with read-only access, progress to write access ("powerups" on the route to on-call)
-
Good starter project patterns: make a trivial user-visible feature change end-to-end, add monitoring for a blind spot you found, automate a pain point
-
Five practices for aspiring on-callers:
- Read and share postmortems; collect best ones prominently for newbies; use them for Wheel of Misfortune rehearsals; "the most appreciative audience of a postmortem is an engineer who hasn't yet been hired"
- Disaster roleplaying (Wheel of Misfortune): 30–60 minute session, primary and secondary attempt root cause, GM can intervene with details to keep it moving
- Break real things, fix real things: divert one instance from live traffic, try to break it from a known good configuration, observe how upstream and downstream systems respond
- Documentation as apprenticeship: on-call checklist must be internalized before shadowing; establishes system boundaries and what's most important
- Shadow on-call early and often: copy alerts to newbie during business hours first; co-author postmortems; use reverse shadowing (senior watches newbie become primary)
-
Some teams conduct final exams before granting full access; on-call is a rite of passage and should be celebrated
Chapter 29 — Dealing with Interrupts
- Operational load categories:
- Pages: production alerts requiring immediate response; always have an associated SLO (minutes); managed by dedicated primary on-call
- Tickets: customer requests requiring action; SLO measured in hours/days/weeks; should not be randomly assigned to team members; processing tickets is a full-time role
- Ongoing operational activities: flag rollouts, answering support questions, time-sensitive inquiries
- Metrics for managing interruptions: interrupt SLO / expected response time, number of backlogged interrupts, severity, frequency, number of people available to handle them
- Most stressed-out on-call engineers are either dealing with pager volume or treating on-call as a constant interrupt — living in a state of constant interruptability is extremely stressful
- Assign a real cost to context switches: a 20-minute interruption while on a project entails two context switches; realistically results in a loss of a couple hours of truly productive work
- Polarize time: be clearly in "project mode" or "interrupt mode" — don't try to mix both simultaneously
- For any interrupt class where volume is too high for one person, add another person
- On-call principles:
- Primary should focus only on on-call work; during quiet times, handle tickets or non-critical interrupt work
- Primary doesn't progress project work; account for this in sprint planning; if there's important project work, don't put that person on-call
- Secondary may do project work; could support primary in high-pager-volume situations by team agreement
- Don't spread ticket load across the entire team — it creates context switches for everyone
- Ticket handoffs should be done the same way as on-call handoffs
- Regularly examine tickets to identify classes of interrupts with a common or root cause
Chapter 30 — Embedding an SRE to Recover from Operational Overload
- One way to relieve burden on an overloaded team: temporarily transfer an SRE into the team
- The embedded SRE focuses on improving practices, not just emptying the ticket queue
- One SRE transfer usually suffices
- "More tickets should not require more SREs" — remind teams of this; unless complexity rises, headcount should not scale with ticket volume
- Identifying kindling (potential crises to address proactively):
- Subsystems not designed to be self-managing
- Knowledge gaps within the team
- Services quietly increasing in importance without being recognized as such
- Strong dependence on "the next big thing" ("the new architecture will change everything — better not do anything now")
- Common alerts not diagnosed by either dev or SRE teams
- Services with complaints but no formal SLO/SLA
- Services where capacity planning always ends at "add more servers"
- Postmortems where the only action items are rolling back the specific change
- Services nobody wants to own (or that devs own one-sidedly)
- Phases of embedded SRE engagement:
- Phase 1 — Learn and get context: shadow on-call, understand what prevents the team from improving reliability, identify largest problems and potential emergencies
- Phase 2 — Share context: write a blameless postmortem for the team; sort fires into toil vs. non-toil
- Phase 3 — Drive change: start with the basics — write SLOs if they don't exist; resist the urge to fix kindling yourself; instead, find accomplishable work for anyone on the team, explain usefulness, review their code, repeat; explain your reasoning to build mental models; ask leading questions
- Final phase — After-action report: a "postvitam" explaining critical decisions at each step that led to success
- "An SLO is probably the single most important lever for moving a team from reactive ops work to a healthy, long-term SRE focus"
- Bad apple theory is flawed: systems with multiple interactions make errors inevitable; success requires establishing proper conditions and teaching sound decision-making principles
Chapter 31 — Communication and Collaboration in SRE
- Production meetings: articulate the state of services, boost org awareness; typical agenda covers upcoming production changes, performance metrics, past outages, paging events, issues requiring attention
- SRE Tech Lead role: code review, quarterly presentations outlining team strategy, facilitating consensus-building; provides direction for the team
- Tech lead vs. manager distinction: tech lead handles most technical management; manager adds performance evaluation and broader organizational responsibilities beyond technical oversight
- Clear communication is an operational skill; good meetings need ownership, purpose, and output
- Documentation and status writing are part of the job, not peripheral chores; a technically strong team can still be operationally weak if it communicates poorly
Chapter 32 — The Evolving SRE Engagement Model
-
Production Readiness Review (PRR) phases:
- Engagement: teams discuss SLOs/SLAs and plan modifications to enhance dependability
- Analysis: service evaluated against production standards and industry practices
- Improvements and refactoring: improvements prioritized and negotiated between dev and SRE
- Training: staff gain knowledge of system architecture and operational procedures
- Onboarding: progressively transfers responsibilities and ownership of various production aspects
- Continuous improvement: maintaining established reliability standards over time
-
PRR helps teams identify what reliability measures a specific service needs, based on its unique characteristics
-
Early engagement: bringing SRE participation earlier in development allows assessment of business importance and whether a service's scale justifies deep SRE involvement
-
Sustainable SRE-driven development: codified best practices, reusable components, standardized platforms, and automated systems that enable smarter infrastructure management
-
SRE capacity is finite; different products have different risk profiles, maturity levels, and engineering cultures; a one-size-fits-all engagement model either wastes scarce expertise or spreads it too thin
-
Engagement models should make it obvious what a team must do to earn deeper SRE support or graduate from it
Part V: Conclusions
Chapter 33 — Lessons Learned from Other Industries
- Four core SRE concepts that parallel mature safety-critical industries: preparedness and disaster training, postmortem culture, automation and reduced operational overhead, structured and rational decision-making
- Cross-industry practices relevant to SRE:
- Unwavering organizational commitment to safety protocols
- Meticulous attention to operational details
- Maintaining excess capacity for contingencies
- Regular simulation exercises and hands-on drills
- Comprehensive staff development and credentialing programs
- Thorough upfront investigation of system specifications and architectural planning
- Layered protective measures against failures
- Software culture is often too eager to believe it invented operational seriousness; fields like aviation, medicine, and manufacturing have been managing risk, human factors, and procedure for much longer
- The best reliability thinking is interdisciplinary; steal aggressively from mature safety disciplines rather than reinventing operational ideas within the boundaries of tech culture
Chapter 34 — Conclusion
- Reliability is not a collection of tricks; it is a way of making trade-offs visible, encoding operational knowledge in systems, and building organizations that can change safely
- Error budgets, toil reduction, actionable alerting, blameless postmortems, graceful overload handling, and sustainable on-call are operating principles, not trends
- The book rejects two bad extremes: the fantasy that reliability can be reduced to process theater alone, and the fantasy that brilliant engineers can improvise their way through production forever
- The answer is engineering discipline, measurement, sane incentives, and a bias toward simplicity
