Introduction
Alert fatigue is killing our industry. We’ve all been there—woken up at 3 AM by a false positive, spending precious sleep hours investigating a “critical” alert that turns out to be a minor blip. Meanwhile, actual production issues slip through because we’ve learned to ignore the noise.
The fundamental challenge in Site Reliability Engineering isn’t just keeping systems running—it’s building alerting and on-call practices that are both effective and sustainable. Too many organizations treat on-call duty as a necessary evil, implementing ad-hoc processes that burn out engineers and create more problems than they solve.
This post provides a comprehensive framework for building alerting strategies and on-call practices that actually work. We’ll cover everything from alert design principles to escalation matrices, accountability structures to burnout prevention. Most importantly, we’ll give you actionable templates and checklists you can implement immediately.
The Alert Quality Framework
Defining High-Quality Alerts
Before diving into on-call practices, we need to establish what makes an alert worth waking someone up for. Every alert should satisfy these criteria:
The Four Pillars of Alert Quality:
- Actionable - The alert provides clear information about what needs to be done
- Relevant - The alert indicates a real problem affecting users or business operations
- Urgent - The issue requires immediate human intervention
- Contextual - The alert includes enough information to begin troubleshooting
Let’s break each of these down:
Actionable Alerts
An actionable alert tells you not just what’s wrong, but gives you enough information to start fixing it. Compare these two alerts:
Bad: “High CPU usage detected” Good: “API server CPU >80% for 5 minutes. Current: 92%. Check /health endpoint, review recent deployments, consider scaling API pods.”
The good alert includes:
- Specific threshold and duration
- Current value for context
- Suggested investigation steps
- Potential remediation actions
Relevant Alerts
Relevant alerts indicate problems that actually impact users or business operations. This means distinguishing between symptoms (what users experience) and causes (underlying technical issues).
Symptom-based alerting focuses on user-facing problems:
- Response time degradation
- Error rate increases
- Feature availability issues
Cause-based alerting focuses on infrastructure problems:
- Disk space usage
- Memory consumption
- Network connectivity
Golden Rule: Alert on symptoms first, causes second. Users don’t care if your disk is 85% full—they care if your application is slow.
Urgent Alerts
Urgent alerts require immediate human intervention. If something can wait until business hours, it shouldn’t page someone at 2 AM. Implement alert severity levels:
- P0 (Critical): Complete service outage, data loss risk, security breach
- P1 (High): Significant service degradation, partial outage affecting >50% of users
- P2 (Medium): Minor service issues, single component failures with redundancy
- P3 (Low): Maintenance items, capacity planning, non-urgent optimization opportunities
Only P0 and P1 alerts should trigger immediate pages. P2 and P3 alerts should go to Slack channels or email for business hours review.
Contextual Alerts
Contextual alerts provide the information needed to begin troubleshooting without requiring extensive investigation. Include:
- Runbook links: Direct links to troubleshooting procedures
- Dashboard links: Quick access to relevant monitoring dashboards
- Recent changes: Deployment history, configuration changes, infrastructure modifications
- Impact assessment: How many users/services are affected
Alert Tuning Methodology
Alert tuning is an ongoing process, not a one-time setup. Here’s a systematic approach:
1. Baseline Establishment (Week 1-2)
Start by collecting baseline metrics for all your key services:
Service Response Time P95: 250ms ± 50ms
Error Rate: 0.1% ± 0.05%
Throughput: 1000 RPS ± 200 RPS
Set initial alert thresholds at 2-3 standard deviations from normal operating ranges. This will be noisy initially, but provides data for tuning.
2. Alert Audit (Week 3-4)
For every alert that fires, ask these questions:
- Did this alert represent a real problem?
- Was immediate action required?
- Did we have enough context to resolve it quickly?
- Would a user have noticed this issue?
Track your answers in a simple spreadsheet:
| Alert Name | Fired | Real Problem? | Action Required? | User Impact? | Resolution Time |
|---|---|---|---|---|---|
| API Response Time | 2024-01-15 03:22 | Yes | Yes | Yes | 15 min |
| Disk Space Warning | 2024-01-15 07:45 | No | No | No | N/A |
3. Threshold Adjustment (Week 5-6)
Based on your audit data:
- Increase thresholds for alerts with high false positive rates
- Decrease thresholds for alerts that miss real problems
- Add context to alerts that require extensive investigation
- Remove or downgrade alerts that don’t require immediate action
4. Continuous Improvement
Establish a weekly alert review process:
- Review all alerts from the previous week
- Calculate signal-to-noise ratio (real problems / total alerts)
- Update thresholds and alert content based on findings
- Share learnings with the broader engineering team
Target Metrics:
- Signal-to-noise ratio: >80%
- Mean time to acknowledge: <5 minutes
- Mean time to resolution: <30 minutes for P0, <2 hours for P1
On-Call Rotation Models
Choosing the Right Rotation Model
The best on-call rotation depends on your team size, service complexity, and business requirements. Here are the most effective models:
Primary/Secondary Model (Recommended for teams of 4-8)
Structure:
- Primary on-call handles all initial alerts
- Secondary on-call serves as backup for escalations
- Each rotation lasts 1 week
- Minimum 4 people in rotation to prevent burnout
Advantages:
- Clear ownership and accountability
- Built-in redundancy for coverage
- Reasonable workload distribution
- Easy to understand and implement
Implementation:
Week 1: Alice (Primary), Bob (Secondary)
Week 2: Carol (Primary), Dave (Secondary)
Week 3: Bob (Primary), Alice (Secondary)
Week 4: Dave (Primary), Carol (Secondary)
Follow-the-Sun Model (For global teams)
Structure:
- Different teams handle on-call for their local business hours
- Seamless handoffs at shift boundaries
- Regional expertise for local infrastructure
Advantages:
- No night/weekend pages for most engineers
- Local knowledge of regional infrastructure
- Better work-life balance
Requirements:
- Distributed team across time zones
- Strong handoff processes
- Shared tooling and documentation
Tier-Based Model (For complex services)
Structure:
- Tier 1: Application/service-specific issues
- Tier 2: Platform/infrastructure issues
- Tier 3: Vendor escalations and complex debugging
Advantages:
- Appropriate expertise for different problem types
- Prevents complex issues from blocking simple fixes
- Clear escalation paths
Disadvantages:
- More complex to manage
- Potential delays in escalation
- Requires well-defined service boundaries
On-Call Handoff Procedures
Effective handoffs are critical for maintaining context and preventing issues from falling through cracks. Implement these procedures:
Pre-Handoff Checklist (30 minutes before shift end)
- Review all open incidents and their current status
- Check monitoring dashboards for any developing issues
- Update incident documentation with current findings
- Prepare handoff notes for ongoing investigations
- Verify secondary on-call is available and prepared
Handoff Communication Template
**On-Call Handoff - [Date] [Time]**
**Ongoing Incidents:**
- INC-2024-001: API latency spike affecting checkout (Started: 14:30, ETA: 16:00)
- INC-2024-002: Database connection pool exhaustion (Started: 15:45, Investigation ongoing)
**Monitoring Status:**
- All systems green except payment service (degraded performance)
- Upcoming maintenance: Database backup at 02:00
**Recent Changes:**
- Deployed API v2.3.1 at 13:00 (no issues observed)
- Configuration change to load balancer at 14:15
**Watch Items:**
- Memory usage trending upward on app-server-03
- Increased error rate from external payment provider
**Action Items for Next Shift:**
- Monitor resolution of INC-2024-001
- Follow up on database performance investigation
- Review memory usage trend before morning traffic spike
Post-Handoff Verification
The incoming on-call engineer should:
- Confirm receipt of handoff information
- Review all open incident details
- Test access to all necessary tools and systems
- Acknowledge understanding of current system state
- Confirm contact information for escalations
Incident Escalation Framework
Escalation Triggers
Clear escalation criteria prevent incidents from languishing and ensure appropriate expertise gets involved quickly. Define escalation triggers for each severity level:
P0 (Critical) Escalation Triggers
Immediate Escalation (0-15 minutes):
- Complete service outage affecting all users
- Data loss or corruption detected
- Security incident with active threat
- Any incident the primary on-call cannot immediately understand
Time-Based Escalation (30 minutes):
- No progress toward resolution
- Additional services becoming affected
- Customer escalations or media attention
- Need for additional specialized expertise
P1 (High) Escalation Triggers
Time-Based Escalation (1 hour):
- No clear path to resolution identified
- Issue affecting >50% of users
- Multiple service components involved
- Need for coordination with external teams
Impact-Based Escalation:
- Customer complaints increasing
- Business-critical functionality impacted
- SLA breach imminent or occurred
Escalation Matrix Template
Here’s a complete escalation matrix you can adapt for your organization:
| Severity | Initial Response | 30min Escalation | 1hr Escalation | 2hr Escalation |
|---|---|---|---|---|
| P0 | Primary On-Call | Secondary + Manager | Director + Subject Experts | VP Engineering + CEO |
| P1 | Primary On-Call | Secondary On-Call | Manager + Relevant Team Lead | Director |
| P2 | Primary On-Call | Create ticket for business hours | Team Lead (next business day) | N/A |
| P3 | Create ticket | N/A | N/A | N/A |
Escalation Communication Protocol
Initial Incident Declaration
When declaring an incident, the on-call engineer must:
- Create incident channel: #incident-2024-001-api-outage
- Post initial assessment:
🚨 INCIDENT DECLARED 🚨 Severity: P1 Impact: API response times >5s, affecting checkout Started: 2024-01-15 14:30 UTC Primary: @alice Secondary: @bob Status: Investigating - Start incident timer: Begin tracking time to resolution
- Notify stakeholders: Page appropriate team members based on severity
Escalation Communication
When escalating, provide a structured update:
🔺 ESCALATING TO [LEVEL] 🔺
**Incident**: #incident-2024-001-api-outage
**Duration**: 45 minutes
**Current Status**: Investigation ongoing
**Actions Taken**:
- Restarted API pods (no improvement)
- Checked database performance (normal)
- Reviewed recent deployments (none in last 4 hours)
**Why Escalating**:
- No clear root cause identified
- Customer complaints increasing
- Need database expertise
**Requested Support**: Database team review + Manager awareness
**ETA for Next Update**: 15 minutes
Subject Matter Expert (SME) Integration
For complex systems, identify and document subject matter experts for each component:
SME Contact Matrix
| System Component | Primary SME | Secondary SME | Escalation Hours |
|---|---|---|---|
| API Gateway | @alice | @bob | 24/7 |
| Database Cluster | @carol | @dave | Business hours |
| Payment Processing | @eve | @frank | 24/7 |
| Authentication | @grace | @henry | Business hours |
| Data Pipeline | @iris | @jack | Business hours |
SME Engagement Protocol
- Clear problem statement: What is broken and how does it manifest?
- Context provision: What investigation has already occurred?
- Specific ask: What expertise or action is needed?
- Time commitment: How long will SME engagement be needed?
Accountability and Responsibility Framework
Role Definitions During Incidents
Clear role definitions prevent confusion and ensure accountability during high-stress situations.
Primary On-Call Responsibilities
Before Incidents:
- Monitor alerting channels and dashboards
- Respond to alerts within 5 minutes
- Perform initial triage and impact assessment
- Maintain situational awareness of system health
During Incidents:
- Serve as Incident Commander for P2/P3 incidents
- Drive initial investigation and containment efforts
- Communicate status updates every 15 minutes for P0/P1
- Document all actions and findings in incident channel
- Escalate according to defined triggers and timelines
After Incidents:
- Complete initial incident report within 24 hours
- Ensure all monitoring and documentation is updated
- Participate in post-incident review process
- Implement immediate preventive measures
Secondary On-Call Responsibilities
Before Incidents:
- Maintain backup readiness with tools and access verified
- Stay aware of any ongoing investigations or system issues
- Be available for escalation within 15 minutes during business hours, 30 minutes outside
During Incidents:
- Support primary on-call with investigation and remediation
- Take over Incident Commander role if primary becomes unavailable
- Coordinate with external teams and subject matter experts
- Handle communication overflow (customer updates, stakeholder briefings)
After Incidents:
- Review incident timeline and provide feedback
- Support post-incident review facilitation
- Help implement preventive measures and process improvements
Engineering Manager Responsibilities
During Incidents:
- Provide air cover and remove obstacles for responding engineers
- Handle escalation communication to executive leadership
- Coordinate with other teams (customer support, sales, marketing)
- Make resource allocation decisions (pulling in additional help)
After Incidents:
- Ensure post-incident review occurs within 72 hours
- Review and approve action items with owners and timelines
- Communicate lessons learned to broader engineering organization
- Track pattern recognition across multiple incidents
Post-Incident Accountability
Immediate Actions (Within 24 Hours)
Primary On-Call Must:
- Complete incident timeline with accurate timestamps
- Document all actions taken and their outcomes
- Identify immediate fixes that prevented user impact
- Note any process gaps or improvement opportunities
- Update monitoring or alerting based on incident learnings
Template for Initial Incident Report:
**Incident Summary**: Brief description of what happened
**Timeline**: Key events with timestamps
**Impact**: User impact, duration, affected services
**Root Cause**: Technical root cause (if known)
**Resolution**: How the incident was resolved
**Immediate Actions**: What was done to prevent recurrence
**Follow-Up Required**: Items needing further investigation
Post-Incident Review Process
Within 72 Hours:
- Schedule blameless post-incident review meeting
- Include all key participants (on-call, SMEs, manager)
- Focus on process and system improvements, not individual performance
- Document agreed-upon action items with owners and deadlines
Review Meeting Agenda:
- Timeline Review (15 min): Walk through incident chronologically
- What Went Well (10 min): Identify effective responses and processes
- What Could Be Better (15 min): Identify improvement opportunities
- Action Items (10 min): Assign specific improvements with deadlines
- Process Feedback (10 min): How can the incident response process improve