Hi, I’m João 👋

SRE & Software Engineer with 6+ years of experience building, securing, and scaling production services. Currently on the Architecture & SRE team at Zwift. I write about reliability engineering, distributed systems, and lessons learned keeping things running at scale.

Welcome to My Blog

After years of meaning to start one, here it is. I’m João — a Senior Software Engineer on the Architecture & SRE team at Zwift, based in Rio de Janeiro. My day-to-day is a mix of platform reliability, incident response, observability, and the kind of slow, unglamorous work that keeps production from catching fire. Why a blog? The honest answer: I learn best by writing. There’s something about forcing an idea into sentences that reveals whether you actually understand it or were just pattern-matching on vibes. I’ve been keeping private notes and post-mortems for years — this is me making some of that public. ...

March 30, 2026 · 2 min · João Pereira

KEDA Autoscaling Best Practices: Mastering Kafka and REST API Workload Scaling

Modern cloud-native applications demand intelligent scaling that goes beyond simple CPU and memory metrics. KEDA (Kubernetes Event-Driven Autoscaling) revolutionizes how we scale workloads by enabling event-driven autoscaling based on external metrics like message queue depth, API response times, and custom application metrics. This comprehensive guide explores production-ready KEDA implementations for two critical use cases: Kafka consumer lag scaling and REST API workload scaling. Prerequisites Before implementing KEDA autoscaling, ensure you have: ...

January 22, 2024 · 6 min · SRE Team

Toil Reduction: Strategic Automation for Operational Excellence

Learn systematic approaches to identifying, measuring, and eliminating operational toil through strategic automation that transforms repetitive manual work into scalable engineering solutions.

January 20, 2024 · 12 min · SRE Team

Capacity Planning: Proactive Resource Management for Scalable Systems

Master capacity planning methodologies, resource forecasting techniques, and proactive scaling strategies to ensure your systems can handle growth while optimizing costs and maintaining performance.

January 19, 2024 · 11 min · SRE Team

SRE Organization Design: Building Effective Team Structures and Collaboration Models

Explore proven SRE organizational patterns, team structures, and collaboration models that enable effective reliability engineering at scale while fostering productive relationships with development teams.

January 18, 2024 · 11 min · SRE Team

Reliability Testing: Systematic Validation of System Resilience

Explore comprehensive reliability testing methodologies, automation frameworks, and systematic validation strategies to ensure your systems can withstand real-world failure conditions.

January 17, 2024 · 10 min · SRE Team

Mastering Incident Postmortems: Turning Failures into Learning Opportunities

Learn how to conduct effective incident postmortems that foster blameless culture, drive systematic improvements, and transform failures into organizational learning opportunities.

January 16, 2024 · 10 min · SRE Team

SRE Alerting and On-Call: A Comprehensive Framework for Sustainable Operations

Introduction Alert fatigue is killing our industry. We’ve all been there—woken up at 3 AM by a false positive, spending precious sleep hours investigating a “critical” alert that turns out to be a minor blip. Meanwhile, actual production issues slip through because we’ve learned to ignore the noise. The fundamental challenge in Site Reliability Engineering isn’t just keeping systems running—it’s building alerting and on-call practices that are both effective and sustainable. Too many organizations treat on-call duty as a necessary evil, implementing ad-hoc processes that burn out engineers and create more problems than they solve. ...

January 15, 2024 · 10 min · SRE Team

Chaos Engineering: Building Resilience Through Controlled Failure

Learn how to implement chaos engineering practices to build more resilient systems through controlled failure experiments and systematic weakness discovery.

January 15, 2024 · 8 min · SRE Team

Advanced Canary Deployments: Orchestrating Istio, Flagger, and KEDA for Production-Ready Progressive Delivery

Learn how to implement sophisticated canary release strategies by integrating Istio service mesh, Flagger progressive delivery controller, and KEDA event-driven autoscaling for reliable, automated deployments at scale.

January 15, 2024 · 7 min · SRE Team