Introduction to Site Reliability Engineering

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure and operations problems. It was born at Google in the early 2000s, and since then it has fundamentally changed how organizations think about running production systems.

But SRE is more than a job title or a team name. It's a set of principles, practices, and cultural norms that help engineering organizations balance the tension between shipping fast and keeping things running.

In this post, I'll walk through what SRE is, where it came from, its core principles, how it compares to DevOps, and common misconceptions that trip people up.

The Origin Story

In 2003, Ben Treynor Sloss joined Google to lead a team responsible for running production systems. Instead of staffing the team with traditional system administrators, he hired software engineers and gave them the mandate to automate their way out of operational work.

The idea was simple but radical: treat operations as a software problem. If running a system requires manual, repetitive work, write code to eliminate it. If reliability is important, measure it with the same rigor you'd apply to any engineering metric.

Treynor described SRE as "what happens when you ask a software engineer to design an operations function." That framing is still the clearest definition I've encountered.

Google published the Site Reliability Engineering book in 2016, making the discipline accessible to the broader industry. Two follow-up books — the SRE Workbook and Building Secure & Reliable Systems — expanded on the practices.

Core Principles

SRE is built on a handful of interconnected principles. Understanding them is essential before diving into specific practices.

SLIs, SLOs, and SLAs

These three acronyms form the foundation of reliability measurement in SRE.

Service Level Indicators (SLIs) are quantitative measures of a service's behavior. They answer the question: "How is the service performing right now?" Common SLIs include:

Availability — the proportion of requests that succeed
Latency — the time it takes to serve a request (usually measured at a percentile, like p99)
Throughput — the rate of requests the system handles
Error rate — the proportion of requests that fail

Service Level Objectives (SLOs) are targets for your SLIs. An SLO says: "We want this SLI to meet this threshold over this time window." For example:

SLO: 99.9% of requests should return successfully within 200ms
      measured over a rolling 30-day window.

Service Level Agreements (SLAs) are contractual commitments — often with financial consequences — made to external customers. An SLA is typically backed by an SLO, but the SLO should be stricter than the SLA. If your SLA promises 99.9% availability, your internal SLO might target 99.95%, giving you a buffer.

The relationship looks like this:

SLI  →  "What are we measuring?"
SLO  →  "What target are we aiming for?"
SLA  →  "What did we promise the customer?"

Error Budgets

An error budget is the inverse of your SLO. If your SLO is 99.9% availability over 30 days, your error budget is 0.1% — roughly 43 minutes of downtime per month.

This might seem like a small detail, but it's one of the most powerful concepts in SRE. Error budgets reframe reliability as a resource to be spent, not an absolute to be maximized.

When you have error budget remaining, you can:

Ship new features aggressively
Run experiments and migrations
Accept more risk in deployments

When the error budget is exhausted, you shift focus:

Slow down feature releases
Prioritize reliability improvements
Invest in automation and testing

This creates a natural feedback loop between development velocity and operational stability. Instead of development and operations teams fighting about risk, they share a concrete, measurable budget.

Toil

Google's SRE book defines toil as work that is:

Manual — a human has to do it
Repetitive — it happens over and over
Automatable — a machine could do it
Tactical — it's reactive, not strategic
Without enduring value — doing it once doesn't prevent it from recurring
Scales linearly with service growth — more traffic means more toil

Examples of toil include manually restarting a crashed service, running a script to rotate credentials every month, or hand-editing configuration files for each deployment.

SRE teams aim to keep toil below 50% of their time. The other 50% should go toward engineering work that reduces future toil or improves system reliability. This isn't just an aspirational target — it's an operational contract. If toil exceeds 50%, something needs to change: either the team automates the work, pushes back on the source, or the organization needs to invest more in tooling.

Automation

Automation is the primary weapon against toil. But not all automation is equal. SRE favors automation that is:

Reliable — the automation itself must not become a source of incidents
Observable — you need to know when automation runs, succeeds, or fails
Idempotent — running it twice should produce the same result as running it once
Gradual — automate the most painful work first, don't boil the ocean

A common progression looks like:

Document the manual process
Script it so a human can run it with one command
Schedule it to run automatically
Self-heal — the system detects the problem and fixes itself

Each step reduces human involvement and increases reliability.

SRE vs DevOps

This is one of the most common questions in the field, and the answer is nuanced.

DevOps is a cultural movement focused on breaking down silos between development and operations teams. It emphasizes collaboration, shared responsibility, CI/CD, and infrastructure as code. DevOps is broad — it's a philosophy more than a prescriptive set of practices.

SRE is a specific implementation of DevOps principles. It provides concrete practices, metrics, and organizational structures for achieving the goals DevOps describes.

As Ben Treynor put it: "SRE is a concrete implementation of DevOps."

Aspect	DevOps	SRE
Focus	Culture and collaboration	Reliability and engineering
Metrics	Deployment frequency, lead time	SLIs, SLOs, error budgets
Approach to risk	"Ship fast, fix fast"	"Spend your error budget wisely"
Toil	Reduce via automation	Measured and capped at 50%
Team structure	Embedded or cross-functional	Dedicated SRE teams or embedded SREs
Scope	Broad (build, deploy, run)	Deep (run, measure, improve)

You don't have to choose between them. Most organizations benefit from DevOps culture broadly and SRE practices where reliability is critical.

Real-World Examples

Error Budgets in Practice

Imagine a payments service with a 99.95% availability SLO (roughly 22 minutes of downtime per month). The team wants to roll out a major database migration.

With error budgets, this becomes a data-driven conversation:

"We've used 5 minutes of our 22-minute budget this month."
"The migration might cause 3-5 minutes of degraded service."
"We have enough budget — let's proceed with a rollback plan ready."

Without error budgets, it's a gut-feel debate: "Is it safe enough?" "Maybe we should wait." The error budget turns vague anxiety into concrete risk management.

Toil Reduction

A team spends 4 hours per week manually scaling their service up before anticipated traffic spikes (marketing campaigns, seasonal peaks). This is textbook toil — manual, repetitive, automatable.

The SRE approach:

Instrument the service to report current load and capacity headroom
Define scaling thresholds based on historical data
Implement autoscaling with a Horizontal Pod Autoscaler
Add alerting for when autoscaling fails or approaches limits

The 4 hours per week of toil becomes an afternoon of engineering work, plus a few minutes of monitoring going forward.

Common Misconceptions

"SRE means 100% uptime." No. SRE explicitly rejects the pursuit of 100% uptime. Perfection is infinitely expensive and actively harmful — if you never have downtime, you're not shipping fast enough. The goal is to find the right reliability target and spend your error budget wisely.

"SRE is just ops with a fancier title." An SRE team that only does operations isn't doing SRE. The 50% engineering / 50% operations split is fundamental. If your "SRE team" is just a renamed ops team with no engineering mandate, you're doing it wrong.

"You need Google's scale to benefit from SRE." SRE principles are scale-independent. A three-person startup benefits from having SLOs and error budgets just as much as a hyperscaler. The practices scale down — you don't need to adopt everything at once.

"SRE replaces developers' responsibility for reliability." SRE is a shared responsibility model. Development teams still own the reliability of their code. SRE teams provide expertise, tooling, and practices — they don't absorb all operational accountability.

Tools Commonly Used in SRE

SRE is tool-agnostic, but certain tools have become standard in the ecosystem:

Monitoring & Observability — Prometheus, Grafana, Datadog, OpenTelemetry
Incident Management — PagerDuty, Opsgenie, incident.io
Infrastructure — Kubernetes, Terraform, Ansible
CI/CD — ArgoCD, GitHub Actions, Jenkins
Chaos Engineering — Chaos Monkey, Litmus, Gremlin
SLO Management — Sloth, Nobl9, Google Cloud SLO monitoring

The tools matter less than the practices behind them. You can implement effective SROs with a spreadsheet and a cron job. The important thing is that you're measuring, setting targets, and making decisions based on data.

Where to Go from Here

If you're new to SRE, start with two things:

Pick one service and define an SLO for it. Choose a meaningful SLI (availability or latency are good starting points), set a target, and start measuring. You'll learn more from this exercise than from reading any book.
Identify your top source of toil. What repetitive operational task consumes the most time on your team? Start there.

In the next post in this series, we'll dive deep into observability — specifically metrics and OpenTelemetry — which is the foundation that makes everything else in SRE possible. You can't set SLOs if you can't measure SLIs, and you can't measure SLIs without solid instrumentation.