8 Articles

Site Reliability Engineering

Week 5: What is Infrastructure as Code (IaC) and How to Implement It

João Pereira November 14th, 2024 14 Nov, 24 7 min read Premium

Hey there, tech enthusiasts! 👋 Manual infrastructure configuration and management can eat up countless hours of your time. Having to manually deploy containers, databases, load balancers and other pieces of infrastructure, finely tuning them, and having to manage it all for possibly multiple environments can be a nightmare and clearly does not scale well. That's what Infrastructure as Code (IaC) aims...

52 Weeks of SRE

Week 4: Incident Management: Key Strategies for SRE and DevOps Teams

João Pereira November 5th, 2024 5 Nov, 24 14 min read Premium

Hey there, reliability champions! 👋 Your incident management strategy can make or break your organization's reliability goals. As SRE and DevOps teams manages increasingly complex systems, effective incident management becomes crucial for maintaining service reliability and customer satisfaction. You need a structured approach that combines monitoring, automation, and proven response strategies to handle incidents efficiently. This post build upon the previous...

52 Weeks of SRE

Week 3: How to Define Effective Service Level Objectives (SLOs) for Your Organization

João Pereira November 1st, 2024 1 Nov, 24 15 min read Premium

Hey there, reliability champions! 👋 This week, giving continuation to our previous post on Monitoring Fundamentals, I'm going to introduce you to one of the most powerful tools in an SRE's belt: Service Level Objectives. If you haven't read last week's post, be sure to check it out here: Week 2: Monitoring FundamentalsLearn monitoring fundamentals:...

52 Weeks of SRE

Building and Deploying a Robust Monitoring Solution for your Applications

João Pereira October 31st, 2024 31 Oct, 24 11 min read Premium

In our previous post on Monitoring Fundamentals, we covered the theoretical aspects of monitoring, including the Four Golden Signals and the importance of metrics and alerting in maintaining reliable systems. Now, let's put that knowledge into practice by building and deploying a complete monitoring stack. Introduction This guide builds upon our existing repository used throughout the "52 Weeks of...

52 Weeks of SRE

Week 2: Monitoring Fundamentals

João Pereira October 31st, 2024 31 Oct, 24 18 min read Premium

Howdy, SRE enthusiasts! 👋 Monitoring is a fundamental aspect of Site Reliability Engineering that involves collecting, processing, aggregating, and displaying real-time data from one or more systems. It's this data that helps engineers understand their system's behavior, health, and performance. In modern distributed systems, effective monitoring is not just beneficial—it's essential for maintaining reliable services. If...

J. Pereira

Site Reliability Engineering

Week 5: What is Infrastructure as Code (IaC) and How to Implement It

Week 4: Incident Management: Key Strategies for SRE and DevOps Teams

Week 3: How to Define Effective Service Level Objectives (SLOs) for Your Organization

Building and Deploying a Robust Monitoring Solution for your Applications

Week 2: Monitoring Fundamentals

Subscribe to J. Pereira