In our previous post on Infrastructure as Code, we covered the theoretical aspects of IaC, including its benefits, common tooling, and best-practices. Now, let's put that knowledge into practice by building and deploying our application's infrastructure, along with a managed database and complete monitoring stack. All deployed and managed by DigitalOcean.

Week 5: What is Infrastructure as Code (IaC) and How to Implement It
Hey there, tech enthusiasts! 👋 Manual infrastructure configuration and management can eat up countless hours of your time. Infrastructure as Code (IaC) changes the way you handle infrastructure management. Your infrastructure configuration becomes software code. This modern approach automates and standardizes your infrastructure deployment process to make it reliable and quick.

Introduction

This guide builds upon our existing repository used throughout the "52 Weeks of SRE" blog series, which is based on the excellent go-base template. We'll work on enhancing this service by defining its required infrastructure as code, using industry-standard tools Terraform and Ansible.

Prerequisites

  • Terraform
  • Ansible
  • Understand the basics of Infrastructure as Code
  • A DigitalOcean account

I'll use DigitalOcean to deploy and manage our applications throughout this guide.

As always, I'll be working on top of the "52 Weeks of SRE" Repository.

GitHub - jpereiramp/52-weeks-of-sre-backend: The back-end project used throughout the “52 Weeks of SRE” blog series
The back-end project used throughout the “52 Weeks of SRE” blog series - jpereiramp/52-weeks-of-sre-backend

Throughout this article we'll be deploying the following infrastructure:

  • 1 Droplet for hosting our Golang App
  • 1 Droplet for hosting our monitoring stack (Grafana & Prometheus)
  • 1 Managed PostgreSQL Database
  • 1 Load Balancer for handling traffic to the Golang App
  • Firewall rules for proper security

We'll first work on defining these infrastructure pieces with Terraform. Then, we'll work on automatically configuring them with Ansible, so that we can get our apps deployed, our Grafana dashboards configured, our Prometheus alert rules setup, and so on...

Terraform's Core Concepts

Let's first go through some of Terraform's basics, the we'll move on to building our infrastructure as code. I highly suggest you checkout Terraform's Official Documentation to deepen your knowledge on all of its features.

1. Configuration Files

  • main.tf: The primary configuration file containing resource definitions
    • Should focus on resource creation and relationships
    • Keep it organized by logical components (networking, compute, storage)
    • Use consistent naming conventions for resources
  • variables.tf: Variable declarations for configuration flexibility
    • Define input variables with clear descriptions
    • Include validation rules for variable values
    • Specify type constraints (string, number, list, map)
    • Use sensitive = true for confidential variables
  • outputs.tf: Defines values to expose after apply
    • Useful for passing information between modules
    • Essential for integration with other tools
    • Can be used for documentation purpose
  • versions.tf: Version constraints for Terraform and providers. Example:
terraform {
  required_version = ">= 1.0.0"
  required_providers {
    digitalocean = {
      source  = "digitalocean/digitalocean"
      version = "~> 2.0"
    }
  }
}

2. Understanding Terraform Providers

Terraform providers are plugins that enable Terraform to interact with cloud providers, SaaS providers, and other APIs. They serve as the bridge between Terraform and external services, translating Terraform configurations into API calls to create and manage resources.

Key Concepts

  • Provider Configuration: Each provider needs to be configured with the necessary credentials and connection details
  • Provider Resources: Providers expose specific resources that can be created and managed
  • Provider Data Sources: Read-only data that can be queried from the provider

Example provider configuration:

# DigitalOcean provider
provider "digitalocean" {
  token = var.do_token
}

# AWS Provider
provider "aws" {
  region     = "us-west-2"
  access_key = var.aws_access_key
  secret_key = var.aws_secret_key
}

# Azure Provider
provider "azurerm" {
  features {}
  subscription_id = var.subscription_id
  tenant_id       = var.tenant_id
}

Provider Registry

Terraform providers are distributed via the Terraform Registry, which serves as the main directory for publicly available providers. The registry includes:

  • Official providers maintained by HashiCorp
  • Partner providers maintained by technology companies
  • Community providers maintained by individual contributors

Provider Documentation

For detailed information about specific providers, consult:

In our example implementation using DigitalOcean, you can find the complete provider documentation at DigitalOcean Provider.

3. State Management Best Practices

  • Use remote state storage (e.g. HCP Terraform, AWS S3, Azure Storage)
  • Enable state locking to prevent concurrent modifications
  • Implement state backup strategies

Example backend configuration using S3:

terraform {
  backend "s3" {
    bucket         = "terraform-state-bucket"
    key            = "project/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-lock-table"
    encrypt        = true
  }
}

4. Module Organization

Modules should be organized by functionality:

modules/
├── networking/
│   ├── main.tf
│   ├── variables.tf
│   └── outputs.tf
├── compute/
│   ├── main.tf
│   ├── variables.tf
│   └── outputs.tf
└── database/
    ├── main.tf
    ├── variables.tf
    └── outputs.tf

Terraform Implementation

The following Terraform configuration provisions the infrastructure in DigitalOcean, including droplets for Docker, App, and Monitoring services, along with a managed PostgreSQL database.

We'll work inside a new folder, terraform under our repository "52 Weeks of SRE". Inside this folder, let's start implementing our infrastructure with Terraform.

  1. terraform/main.tf
terraform {
  required_providers {
    digitalocean = {
      source  = "digitalocean/digitalocean"
      version = "~> 2.0"
    }
  }
}

provider "digitalocean" {
  token = var.do_token
}

# VPC for network isolation
resource "digitalocean_vpc" "app_network" {
  name   = "app-network-${var.environment}"
  region = var.region
}

# Application Droplet
resource "digitalocean_droplet" "app_host" {
  name     = "app-host-${var.environment}"
  size     = var.app_droplet_size
  image    = "ubuntu-22-04-x64"
  region   = var.region
  vpc_uuid = digitalocean_vpc.app_network.id
  ssh_keys = [var.ssh_key_fingerprint]

  tags = ["app-host", var.environment]

  lifecycle {
    create_before_destroy = true
    prevent_destroy       = false
  }
}

# Monitoring Droplet
resource "digitalocean_droplet" "monitoring_host" {
  name     = "monitoring-host-${var.environment}"
  size     = var.monitoring_droplet_size
  image    = "ubuntu-22-04-x64"
  region   = var.region
  vpc_uuid = digitalocean_vpc.app_network.id
  ssh_keys = [var.ssh_key_fingerprint]

  tags = ["monitoring-host", var.environment]

  lifecycle {
    create_before_destroy = true
    prevent_destroy       = false
  }
}

# Managed PostgreSQL Database
resource "digitalocean_database_cluster" "app_db" {
  name       = "app-db-${var.environment}"
  engine     = "pg"
  version    = "14"
  size       = var.db_size
  region     = var.region
  node_count = var.environment == "production" ? 3 : 1

  maintenance_window {
    day  = "sunday"
    hour = "02:00:00"
  }
}

# Database Firewall Rules
resource "digitalocean_database_firewall" "app_db_fw" {
  cluster_id = digitalocean_database_cluster.app_db.id

  # Allow access from App host
  rule {
    type  = "droplet"
    value = digitalocean_droplet.app_host.id
  }
}

# Load Balancer for production environment
resource "digitalocean_loadbalancer" "app_lb" {
  count  = var.environment == "production" ? 1 : 0
  name   = "app-lb-${var.environment}"
  region = var.region

  vpc_uuid = digitalocean_vpc.app_network.id

  forwarding_rule {
    entry_port      = 80
    entry_protocol  = "http"
    target_port     = 8080
    target_protocol = "http"
  }

  healthcheck {
    port     = 8080
    protocol = "http"
    path     = "/health"
  }

  droplet_ids = [digitalocean_droplet.app_host.id]
}

# Firewall rules for App host
resource "digitalocean_firewall" "app_firewall" {
  name = "app-firewall-${var.environment}"

  droplet_ids = [digitalocean_droplet.app_host.id]

  inbound_rule {
    protocol   = "tcp"
    port_range = "22"
    source_addresses = ["0.0.0.0/0", "::/0"]
  }

  inbound_rule {
    protocol   = "tcp"
    port_range = "8080"
    source_addresses = ["0.0.0.0/0", "::/0"]
  }

  outbound_rule {
    protocol = "tcp"
    port_range = "1-65535"
    source_addresses = ["0.0.0.0/0", "::/0"]
  }
}

# Firewall rules for Monitoring host
resource "digitalocean_firewall" "monitoring_firewall" {
  name = "monitoring-firewall-${var.environment}"

  droplet_ids = [digitalocean_droplet.monitoring_host.id]

  inbound_rule {
    protocol   = "tcp"
    port_range = "22"
    source_addresses = ["0.0.0.0/0", "::/0"]
  }

  inbound_rule {
    protocol = "tcp"
    port_range = "3000"  # Grafana
    source_addresses = ["0.0.0.0/0", "::/0"] # Allow connections from anywhere
  }

  inbound_rule {
    protocol = "tcp"
    port_range = "9090"  # Prometheus
    source_addresses = [digitalocean_vpc.app_network.ip_range] # Only allow connections from VPC
  }

  outbound_rule {
    protocol = "tcp"
    port_range = "1-65535"
    source_addresses = ["0.0.0.0/0", "::/0"]
  }
}
  1. terraform/variables.tf
variable "do_token" {
  description = "DigitalOcean API Token with write access"
  type        = string
  sensitive   = true
}

variable "region" {
  description = "DigitalOcean region for resource deployment"
  type        = string
  default     = "nyc1"

  validation {
    condition     = can(regex("^[a-z]{3}[1-3]$", var.region))
    error_message = "Region must be a valid DigitalOcean region code (e.g., nyc1, sfo2, etc.)"
  }
}

variable "environment" {
  description = "Environment name (staging/production)"
  type        = string
  default     = "staging"

  validation {
    condition     = contains(["staging", "production"], var.environment)
    error_message = "Environment must be either 'staging' or 'production'"
  }
}

variable "ssh_key_fingerprint" {
  description = "SSH key fingerprint for Droplet access"
  type        = string
}

variable "app_droplet_size" {
  description = "Size of the App host droplet"
  type        = string
  default     = "s-1vcpu-1gb"
}

variable "monitoring_droplet_size" {
  description = "Size of the Monitoring host droplet"
  type        = string
  default     = "s-1vcpu-1gb"
}

variable "db_size" {
  description = "Size of the database cluster"
  type        = string
  default     = "db-s-1vcpu-1gb"
}
  1. terraform/outputs.tf
output "app_host_ip" {
  description = "Public IP address of the App host"
  value       = digitalocean_droplet.app_host.ipv4_address
}

output "monitoring_host_ip" {
  description = "Public IP address of the Monitoring host"
  value       = digitalocean_droplet.monitoring_host.ipv4_address
}

output "load_balancer_ip" {
  description = "Public IP address of the Load Balancer (production only)"
  value       = var.environment == "production" ? digitalocean_loadbalancer.app_lb[0].ip : null
}

output "database_host" {
  description = "Database connection host"
  value       = digitalocean_database_cluster.app_db.host
  sensitive   = true
}

output "database_port" {
  description = "Database connection port"
  value       = digitalocean_database_cluster.app_db.port
}

output "vpc_id" {
  description = "ID of the created VPC"
  value       = digitalocean_vpc.app_network.id
}

With these 3 files, we have everything we need to properly deploy our app's infrastructure to DigitalOcean! Furthermore, we can easily re-use it for different environments (imagine you want a hosted development as well as the production environments).

Now let's look at some best practices when working with Terraform, and then we'll move on to implementing the required configuration of these infrastructures – installing dependencies, configuring our services, and more...

Terraform Best Practices

  1. Resource Naming
    • Use consistent naming conventions
    • Include environment in resource names
    • Use tags for better organization
  2. Security
    • Store sensitive values in secret management systems
    • Use least privilege access for service accounts
    • Implement network security groups and firewall rules
  3. State Management
    • Use remote state storage
    • Enable state locking
    • Implement state backup strategy
  4. Module Development
    • Create reusable modules for common patterns
    • Version your modules
    • Document module inputs and outputs
  5. Configuration
    • Use data sources when possible
    • Implement proper error handling and field validation
    • Use count and for_each for resource iteration
  6. Lifecycle Management
    • Implement proper destroy prevention
    • Use create_before_destroy when appropriate
    • Plan for zero-downtime updates

Working with Terraform

Common commands:

# Initialize working directory
terraform init

# Plan changes
terraform plan -out=tfplan

# Apply changes
terraform apply tfplan

# Destroy infrastructure
terraform destroy

# Format configuration
terraform fmt

# Validate configuration
terraform validate

# Show current state
terraform show

Ansible's Core Concepts

Now let's look at Ansible, a powerful tool for configuring and automating your services and workflows!

Inventory

The inventory is Ansible's way of defining and organizing the hosts (servers) that it manages. It can be a simple static file listing IP addresses or hostnames, or a dynamic inventory script that pulls host information from cloud providers or other infrastructure systems. Inventories can group hosts together (e.g., "web_servers", "databases") and assign variables to specific hosts or groups, making it easier to manage different types of servers with different configurations.

Roles

Roles are reusable units of automation in Ansible that contain all the necessary tasks, variables, handlers, and files needed to configure a specific aspect of a system. They provide a way to organize complex automation tasks into modular components that can be easily shared and reused across different projects. For example, you might have a "docker" role that handles Docker installation and configuration, or a "nginx" role that sets up and configures the Nginx web server.

Role Organization

Each role follows a standardized structure:

roles/role_name/
├── defaults/     # Default variables (lowest precedence)
│   └── main.yml
├── vars/         # Role variables (higher precedence)
│   └── main.yml
├── tasks/        # Task definitions
│   └── main.yml
├── handlers/     # Event handlers
│   └── main.yml
├── templates/    # Jinja2 templates
│   └── config.j2
└── meta/         # Role metadata and dependencies
    └── main.yml

Group Variables

Group variables (stored in group_vars directory) provide a way to set variables that apply to specific groups of hosts defined in your inventory. These variables can include configuration settings, credentials, or any other data needed for automation tasks. For example, you might have different database credentials for production and staging environments, stored in their respective group_vars files.

Playbooks

Playbooks are Ansible's configuration, deployment, and orchestration language. They are YAML files that describe the desired state of your systems and the steps needed to get there. A playbook contains one or more "plays," each defining a set of hosts to configure and the tasks to run on those hosts.

Tasks can be executed directly or organized into roles, and playbooks can include variables, handlers, and other Ansible features to create sophisticated automation workflows.

Ansible Config (ansible.cfg)

The ansible.cfg file is Ansible's configuration file that sets various default behaviors and settings for Ansible operations. It can specify default locations for roles, inventory files, and SSH keys, set privilege escalation settings, configure connection types and timeouts, and modify various other behavioral aspects of Ansible.

While Ansible will work without this file using built-in defaults, customizing ansible.cfg allows you to tailor Ansible's behavior to your specific needs and environment.

Ansible Implementation

Let's start to build our Ansible configuration! We'll look at each file individually, and then at how we can apply these configurations to our DigitalOcean droplets.

1. Application Role (roles/app)

  • roles/app/handlers/main.yml
- name: restart goapp
  systemd:
    name: goapp
    state: restarted
    daemon_reload: yes
  • roles/app/tasks/main.yml
- name: Install Go
  block:
    - name: Download Go
      get_url:
        url: "https://go.dev/dl/go{{ go_version }}.linux-amd64.tar.gz"
        dest: /tmp/go.tar.gz

    - name: Extract Go
      unarchive:
        src: /tmp/go.tar.gz
        dest: /usr/local
        remote_src: yes

    - name: Add Go to PATH
      copy:
        dest: /etc/profile.d/go.sh
        content: |
          export PATH=$PATH:/usr/local/go/bin
        mode: '0644'

- name: Install git
  apt:
    name: git
    state: present

- name: Create app group
  group:
    name: "{{ app_group }}"
    state: present
    system: yes

- name: Create app user
  user:
    name: "{{ app_user }}"
    group: "{{ app_group }}"
    system: yes
    create_home: yes
    shell: /bin/bash

- name: Create app directory
  file:
    path: "{{ app_dir }}"
    state: directory
    owner: "{{ app_user }}"
    group: "{{ app_group }}"
    mode: '0755'

- name: Clone/update application repository
  git:
    repo: "https://github.com/jpereiramp/52-weeks-of-sre-backend.git"
    dest: "{{ app_dir }}"
    force: yes
  become: yes
  become_user: "{{ app_user }}"

- name: Install application dependencies
  shell: |
    . /etc/profile.d/go.sh
    go mod download
  args:
    chdir: "{{ app_dir }}"
  become: yes
  become_user: "{{ app_user }}"

- name: Run database migrations
  shell: |
    . /etc/profile.d/go.sh
    go run main.go migrate
  args:
    chdir: "{{ app_dir }}"
  become: yes
  become_user: "{{ app_user }}"
  environment: "{{ app_env }}"
  register: migration_output

- name: Create systemd service
  template:
    src: app.service.j2
    dest: /etc/systemd/system/goapp.service
    mode: '0644'
  notify: restart goapp

- name: Enable and start goapp service
  systemd:
    name: goapp
    state: started
    enabled: yes
    daemon_reload: yes
  • roles/app/templates/app.service.j2
[Unit]
Description=Go Application
After=network.target

[Service]
Type=simple
User={{ app_user }}
Group={{ app_group }}
WorkingDirectory={{ app_dir }}
Environment=PATH=/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Environment=GOPATH=/home/{{ app_user }}/go
{% for key, value in app_env.items() %}
Environment={{ key }}={{ value }}
{% endfor %}

ExecStart=/usr/local/go/bin/go run main.go serve
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

2. Monitoring Role (roles/monitoring)

  • roles/monitoring/handlers/main.yml
- name: restart monitoring stack
  community.docker.docker_compose:
    project_src: /opt/monitoring
    state: present
    restarted: yes
  • roles/monitoring/tasks/main.yml
---
- name: Install Python dependencies
  apt:
    name:
      - python3-pip
      - python3-setuptools
    state: present

- name: Install Docker
  block:
    - name: Add Docker GPG key
      apt_key:
        url: https://download.docker.com/linux/ubuntu/gpg
        state: present

    - name: Add Docker repository
      apt_repository:
        repo: deb [arch=amd64] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable
        state: present

    - name: Install Docker packages
      apt:
        name:
          - docker-ce
          - docker-ce-cli
          - containerd.io
        state: present

    - name: Install Docker Python package
      pip:
        name:
          - docker==6.1.3
          - urllib3<2.0
        state: present

    - name: Ensure Docker service is running
      service:
        name: docker
        state: started
        enabled: yes

    - name: Add users to docker group
      user:
        name: "{{ item }}"
        groups: docker
        append: yes
      loop: "{{ docker_users }}"
  notify: restart docker

- name: Create monitoring directories
  file:
    path: "{{ item }}"
    state: directory
    mode: '0755'
  loop:
    - /opt/monitoring
    - /opt/monitoring/prometheus
    - /opt/monitoring/prometheus/rules
    - /opt/monitoring/grafana
    - /opt/monitoring/grafana/dashboards
    - /opt/monitoring/grafana/alerts
    - /opt/monitoring/grafana/provisioning/dashboards
    - /opt/monitoring/grafana/provisioning/datasources
    - /opt/monitoring/grafana/provisioning/alerting

- name: Copy Prometheus rules
  copy:
    src: "{{ playbook_dir }}/../config/prometheus/login_slo_rules.yaml"
    dest: "/opt/monitoring/prometheus/rules/login_slo_rules.yaml"
    mode: '0644'
  notify: restart monitoring stack

- name: Configure Prometheus
  template:
    src: prometheus.yml.j2
    dest: /opt/monitoring/prometheus/prometheus.yml
    mode: '0644'
  notify: restart monitoring stack

- name: Copy Grafana dashboards
  copy:
    src: "{{ playbook_dir }}/../config/grafana/dashboards/"
    dest: "/opt/monitoring/grafana/dashboards/"
    mode: '0644'

- name: Copy Grafana alert rules
  copy:
    src: "{{ playbook_dir }}/../config/grafana/alerts/login_errors_alert.json"
    dest: "/opt/monitoring/grafana/provisioning/alerting/rules.json"
    mode: '0644'

- name: Configure Grafana dashboard provisioning
  copy:
    content: |
      apiVersion: 1
      providers:
        - name: 'default'
          orgId: 1
          folder: ''
          type: file
          disableDeletion: false
          updateIntervalSeconds: 10
          allowUiUpdates: true
          options:
            path: /var/lib/grafana/dashboards
    dest: /opt/monitoring/grafana/provisioning/dashboards/default.yaml
    mode: '0644'

- name: Configure Grafana datasource
  template:
    src: datasource.yml.j2
    dest: /opt/monitoring/grafana/provisioning/datasources/datasource.yml
    mode: '0644'

- name: Deploy Docker Compose file
  template:
    src: docker-compose.yml.j2
    dest: /opt/monitoring/docker-compose.yml
    mode: '0644'

- name: Ensure Docker socket has correct permissions
  file:
    path: /var/run/docker.sock
    mode: '0666'

- name: Clean existing containers
  shell: |
    if [ -f docker-compose.yml ]; then
      docker compose down -v
    fi
  args:
    chdir: /opt/monitoring

- name: Start monitoring stack
  command: docker compose up -d
  args:
    chdir: /opt/monitoring
  register: compose_output
  changed_when: compose_output.stdout != ""
  • roles/monitoring/grafana/templates/datasource.yml.j2
apiVersion: 1

datasources:
  - name: Prometheus
    uid: prometheus-default
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
  • roles/monitoring/grafana/templates/docker-compose.yml.j2
version: '3.7'

services:
    prometheus:
      image: prom/prometheus:latest
      volumes:
        - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
        - ./prometheus/rules:/etc/prometheus/rules:ro
        - prometheus_data:/prometheus
      command:
        - '--config.file=/etc/prometheus/prometheus.yml'
        - '--storage.tsdb.retention.time={{ prometheus_retention_time }}'
        - '--web.enable-lifecycle'
      ports:
        - "9090:9090"
      restart: unless-stopped
      user: "65534:65534"  # nobody:nogroup

    grafana:
      image: grafana/grafana:latest
      volumes:
        - ./grafana/provisioning:/etc/grafana/provisioning:ro
        - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
        - grafana_data:/var/lib/grafana
      environment:
        - GF_SECURITY_ADMIN_PASSWORD={{ grafana_admin_password }}
        - GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/var/lib/grafana/dashboards/golden_signals_dashboard.json
      ports:
        - "3000:3000"
      depends_on:
        - prometheus
      restart: unless-stopped
      user: "472:472"  # grafana:grafana

volumes:
    prometheus_data:
        driver: local
    grafana_data:
        driver: local
  • roles/monitoring/grafana/templates/prometheus.yml.j2
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yaml"

scrape_configs:
  - job_name: 'go-service'
    static_configs:
      - targets: ['{{ hostvars["app-server"]["ansible_host"] }}:8080']
    metrics_path: '/metrics'

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

3. Commons Role (roles/common)

  • roles/common/tasks/main.yml
---
- name: Update apt cache
  apt:
    update_cache: yes
    cache_valid_time: 3600

- name: Install common packages
  apt:
    name:
      - apt-transport-https
      - ca-certificates
      - curl
      - software-properties-common
      - python3-pip
      - python3-setuptools
      - python3-wheel
    state: present

4. Group Vars

  • group_vars/all/vars.yml
# System Configuration
timezone: UTC

# SSH Configuration
ssh_port: 22
  • group_vars/all/vault.yml (Encrypted)
# SSL Certificates
ssl_private_key: |
  -----BEGIN OPENSSH PRIVATE KEY-----
  YOUR-KEY-HERE
  -----END OPENSSH PRIVATE KEY-----
  • group_vars/app/vars.yml
go_version: "1.23.3"
app_user: goapp
app_group: goapp
app_dir: /opt/goapp
app_env:
  GIN_MODE: "release"
  PORT: 8080
  DB_DSN: "postgres://{{ db_user }}:{{ db_password }}@{{ db_host }}:{{ db_port }}/{{ db_name }}"
  • group_vars/app/vault.yml
# Database Credentials
db_user: doadmin
db_password: supersecret123
  • group_vars/monitoring/vars.yml
prometheus_retention_time: "15d"
docker_users:
  - root
  • group_vars/monitoring/vault.yml
grafana_admin_password: changeme

5. Inventory

  • inventory/staging/hosts.yml
---
all:
  children:
    app:
      hosts:
        app-server:
          ansible_host: "{{ app_server_ip }}"
          ansible_user: root  # Explicitly set the user
          ansible_ssh_private_key_file: "~/.ssh/id_ed25519"
    monitoring:
      hosts:
        monitoring-server:
          ansible_host: "{{ monitoring_server_ip }}"
          ansible_user: root  # Explicitly set the user
          ansible_ssh_private_key_file: "~/.ssh/id_ed25519"

6. Core Files

  • site.yml
- name: Configure common settings for all servers
  hosts: all
  become: true
  roles:
    - common

- name: Configure Go application server
  hosts: app
  become: true
  roles:
    - app

- name: Configure monitoring server
  hosts: monitoring
  become: true
  roles:
    - monitoring
  • requirements.yml (External Dependencies)
---
collections:
  - name: community.docker
    version: "3.4.0"
  - name: community.general
    version: "7.0.0"
  • ansible.cfg
[defaults]
inventory = inventory/staging/hosts.yml
roles_path = roles
host_key_checking = False
remote_user = root
private_key_file = ~/.ssh/id_ed25519

# Performance tuning
forks = 20
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_fact_cache
fact_caching_timeout = 7200

# Logging
log_path = logs/ansible.log
callback_whitelist = profile_tasks

[privilege_escalation]
become = True
become_method = sudo
become_user = root
become_ask_pass = False

[ssh_connection]
pipelining = True
control_path = /tmp/ansible-ssh-%%h-%%p-%%r

Usage Examples

# Install dependencies
ansible-galaxy install -r requirements.yml

# Run playbook (staging)
ansible-playbook -i inventory/staging site.yml

# Run specific roles
ansible-playbook -i inventory/staging site.yml --tags "docker,app"

# Validate syntax
ansible-playbook -i inventory/staging site.yml --syntax-check

# Dry-run
ansible-playbook -i inventory/staging site.yml --check

Best Practices

  1. Inventory Management
    • Use separate inventory files for different environments
    • Group hosts logically
    • Use group_vars for environment-specific variables
  2. Role Development
    • Keep roles focused and single-purpose
    • Use meaningful tags for selective execution
    • Implement proper handlers for service management
    • Use defaults/main.yml for configurable variables
    • Document role variables and dependencies
  3. Security
    • Use ansible-vault for encrypting sensitive data
    • Implement proper file permissions
    • Use least privilege principle
    • Regularly update dependencies
  4. Task Design
    • Make tasks idempotent
    • Use meaningful names for tasks
    • Include proper error handling
    • Use handlers for service management
    • Implement proper checks and validations
  5. Variable Management
    • Use meaningful variable names
    • Define default values
    • Document variable purposes
    • Use proper variable precedence
  6. Configuration Templates
    • Use Jinja2 templates for configuration files
    • Include proper error handling
    • Document template variables
    • Use consistent formatting

Additional Considerations

  1. Version Control
    • Keep all Ansible files in version control
    • Use .gitignore for sensitive files
    • Tag releases for different versions
  2. Secret Management
    • Use ansible-vault for sensitive data
    • Consider external secret management systems
    • Rotate secrets regularly
  3. Documentation
    • Document role requirements
    • Keep README files updated
    • Document variable precedence
  4. Testing
    • Use Molecule for role testing
    • Implement CI/CD pipelines
    • Test in staging before production
  5. Backup Strategy
    • Backup Ansible configuration
    • Backup inventory data
    • Document recovery procedures

Deploying our Infrastructure

Now that we have all of our Infrastructure defined as code, and their initial configurations all setup, let's work on our final step: deploying it all to DigitalOcean. We'll work on a staging environment, and on upcoming series we'll work on creating a new, production-ready, prod environment.

As always, the full source code we worked on throughout this article is available at the "52 Weeks of SRE" Repository.

Add project files for Week 5 - Infrastructure as Code by jpereiramp · Pull Request #3 · jpereiramp/52-weeks-of-sre-backend

Before being able to deploy your infrastructure, you'll first need to create a DigitalOcean account, and then you must setup an SSH Key as well as a Personal Access Token (PAT). Checkout these how-to guides:

  1. Configure required Variables
    • Create a terraform/terraform.tfvars file with your DigitalOcean credentials
environment = "staging"
do_token = "dop_v1_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Your DigitalOcean PAT
    • Create a ansible/.vault_pass file with a secret password for encrypting Ansible's sensitive data
    • Update Terraform specs with desired values (e.g. region, instance size)
  1. Install Ansible Dependencies
ansible-galaxy install -r requirements.yml
  1. Initialize Terraform
# Initialize Terraform
cd terraform
terraform init -var-file=terraform.tfvars
  1. Deploy Infrastructure
# Deploy with Terraform
terraform apply

# Export variables for Ansible
export APP_HOST=$(terraform output -raw app_host_ip)
export MONITORING_HOST=$(terraform output -raw monitoring_host_ip)
export DB_HOST=$(terraform output -raw database_host)
export DB_PORT=$(terraform output -raw database_port)
  1. Run Ansible Playbook
cd ../ansible

# Run Playbook
ansible-playbook site.yml \
  -e "app_server_ip=$APP_HOST" \
  -e "monitoring_server_ip=$MONITORING_HOST" \
  -e "db_host=$DB_HOST" \
  -e "db_port=$DB_PORT" \
  -e "db_name=defaultdb"

Testing your Setup

  1. Access Grafana
    • Navigate to http://<monitoring-droplet-ip>:3000
    • Login with admin as username and your configured password
  2. Access Prometheus
    • Navigate to http://<monitoring-droplet-ip>:9090
    • Verify that targets are being scraped successfully
  3. Verify Application Metrics
    • Check http://<app-droplet-ip>/metrics endpoint is exposed by your Go application
    • Verify Prometheus is successfully scraping the endpoint
    • Import recommended Go application dashboards in Grafana

Maintenance and Updates

  1. Monitoring Updates
    • Regular updates to Prometheus and Grafana configurations
    • Dashboard improvements based on needs
    • Alert rule refinements

Application Updates

# Update application configuration
cd ansible
ansible-playbook -i inventory/hosts.yml site.yml --tags update

Infrastructure Updates

# Update infrastructure
cd terraform
terraform plan  # Review changes
terraform apply # Apply changes

Best Practices

  1. Security
    • Use environment variables for sensitive data
    • Implement proper firewall rules
    • Regular security updates
    • Restrict access to monitoring endpoints
  2. Backup
    • Regular database backups (handled by DigitalOcean)
    • Backup Grafana dashboards
    • Export and version control monitoring configurations
  3. Monitoring
    • Set up alerting rules
    • Regular review of metrics
    • Dashboard optimization
    • Capacity planning based on metrics
  4. Scaling
    • Monitor resource usage
    • Plan for horizontal scaling
    • Regular performance optimization

Subscribe to ensure you don't miss next week's deep dive into basics of Linux for SRE. Your journey to mastering SRE continues! 🎯


If you enjoyed the content and would like to support a fellow engineer with some beers, click the button below :)