Welcome to My Blog
After years of meaning to start one, here it is. I’m João — a Senior Software Engineer on the Architecture & SRE team at Zwift, based in Rio de Janeiro. My day-to-day is a mix of platform reliability, incident response, observability, and the kind of slow, unglamorous work that keeps production from catching fire. Why a blog? The honest answer: I learn best by writing. There’s something about forcing an idea into sentences that reveals whether you actually understand it or were just pattern-matching on vibes. I’ve been keeping private notes and post-mortems for years — this is me making some of that public. ...
KEDA Autoscaling Best Practices: Mastering Kafka and REST API Workload Scaling
Modern cloud-native applications demand intelligent scaling that goes beyond simple CPU and memory metrics. KEDA (Kubernetes Event-Driven Autoscaling) revolutionizes how we scale workloads by enabling event-driven autoscaling based on external metrics like message queue depth, API response times, and custom application metrics. This comprehensive guide explores production-ready KEDA implementations for two critical use cases: Kafka consumer lag scaling and REST API workload scaling. Prerequisites Before implementing KEDA autoscaling, ensure you have: ...
Toil Reduction: Strategic Automation for Operational Excellence
Learn systematic approaches to identifying, measuring, and eliminating operational toil through strategic automation that transforms repetitive manual work into scalable engineering solutions.
Capacity Planning: Proactive Resource Management for Scalable Systems
Master capacity planning methodologies, resource forecasting techniques, and proactive scaling strategies to ensure your systems can handle growth while optimizing costs and maintaining performance.
SRE Organization Design: Building Effective Team Structures and Collaboration Models
Explore proven SRE organizational patterns, team structures, and collaboration models that enable effective reliability engineering at scale while fostering productive relationships with development teams.
Reliability Testing: Systematic Validation of System Resilience
Explore comprehensive reliability testing methodologies, automation frameworks, and systematic validation strategies to ensure your systems can withstand real-world failure conditions.
Mastering Incident Postmortems: Turning Failures into Learning Opportunities
Learn how to conduct effective incident postmortems that foster blameless culture, drive systematic improvements, and transform failures into organizational learning opportunities.
SRE Alerting and On-Call: A Comprehensive Framework for Sustainable Operations
Introduction Alert fatigue is killing our industry. We’ve all been there—woken up at 3 AM by a false positive, spending precious sleep hours investigating a “critical” alert that turns out to be a minor blip. Meanwhile, actual production issues slip through because we’ve learned to ignore the noise. The fundamental challenge in Site Reliability Engineering isn’t just keeping systems running—it’s building alerting and on-call practices that are both effective and sustainable. Too many organizations treat on-call duty as a necessary evil, implementing ad-hoc processes that burn out engineers and create more problems than they solve. ...
Chaos Engineering: Building Resilience Through Controlled Failure
Learn how to implement chaos engineering practices to build more resilient systems through controlled failure experiments and systematic weakness discovery.
Advanced Canary Deployments: Orchestrating Istio, Flagger, and KEDA for Production-Ready Progressive Delivery
Learn how to implement sophisticated canary release strategies by integrating Istio service mesh, Flagger progressive delivery controller, and KEDA event-driven autoscaling for reliable, automated deployments at scale.