Hi, I’m João 👋

SRE & Software Engineer with 6+ years of experience building, securing, and scaling production services. Currently on the Architecture & SRE team at Zwift. I write about reliability engineering, distributed systems, and lessons learned keeping things running at scale.

Welcome to My Blog

After years of meaning to start one, here it is. I’m João — a Senior Software Engineer on the Architecture & SRE team at Zwift, based in Rio de Janeiro. My day-to-day is a mix of platform reliability, incident response, observability, and the kind of slow, unglamorous work that keeps production from catching fire. Why a blog? The honest answer: I learn best by writing. There’s something about forcing an idea into sentences that reveals whether you actually understand it or were just pattern-matching on vibes. I’ve been keeping private notes and post-mortems for years — this is me making some of that public. ...

March 30, 2026 · 2 min · João Pereira

KEDA Autoscaling Best Practices: Mastering Kafka and REST API Workload Scaling

Modern cloud-native applications demand intelligent scaling that goes beyond simple CPU and memory metrics. KEDA (Kubernetes Event-Driven Autoscaling) revolutionizes how we scale workloads by enabling event-driven autoscaling based on external metrics like message queue depth, API response times, and custom application metrics. This comprehensive guide explores production-ready KEDA implementations for two critical use cases: Kafka consumer lag scaling and REST API workload scaling. Prerequisites Before implementing KEDA autoscaling, ensure you have: ...

January 22, 2024 · 6 min · SRE Team

SRE Alerting and On-Call: A Comprehensive Framework for Sustainable Operations

Introduction Alert fatigue is killing our industry. We’ve all been there—woken up at 3 AM by a false positive, spending precious sleep hours investigating a “critical” alert that turns out to be a minor blip. Meanwhile, actual production issues slip through because we’ve learned to ignore the noise. The fundamental challenge in Site Reliability Engineering isn’t just keeping systems running—it’s building alerting and on-call practices that are both effective and sustainable. Too many organizations treat on-call duty as a necessary evil, implementing ad-hoc processes that burn out engineers and create more problems than they solve. ...

January 15, 2024 · 10 min · SRE Team

Advanced Canary Deployments: Orchestrating Istio, Flagger, and KEDA for Production-Ready Progressive Delivery

Learn how to implement sophisticated canary release strategies by integrating Istio service mesh, Flagger progressive delivery controller, and KEDA event-driven autoscaling for reliable, automated deployments at scale.

January 15, 2024 · 7 min · SRE Team

Implementing SLOs for Reliability: A Practical Framework for Service Level Objectives in Production

Learn how to design, implement, and operationalize Service Level Objectives (SLOs) with practical frameworks, real-world examples, and monitoring configurations that drive reliable service delivery.

January 15, 2024 · 7 min · SRE Team

Implementing the Golden Four Signals: A Practical Guide to SRE Monitoring

Site Reliability Engineers face a fundamental challenge: monitoring complex distributed systems without drowning in metrics noise. Google’s Golden Four signals provide a battle-tested framework for focusing on what truly matters for service reliability. In this comprehensive guide, we’ll walk through practical implementations using Prometheus, Grafana, and Datadog, complete with production-ready configurations and real-world examples. Prerequisites Before diving into implementations, ensure you have: Basic understanding of Kubernetes and container orchestration Familiarity with Prometheus metrics and PromQL queries Access to a Kubernetes cluster (kind, minikube, or cloud-managed) Basic knowledge of HTTP status codes and API design principles Estimated implementation time: 2-4 hours depending on your existing monitoring setup. ...

January 15, 2024 · 6 min · SRE Team