Skip to content

07 The Evolution of Automation at Google

Recap

Review

The key themes seem to be the importance of automation in SRE, the principles behind effective automation, different types of automation, and real-world examples like the Ads Database and Borg. The chapter also discusses the challenges and lessons learned from automation, such as the need for idempotency and the risks of automation failure.

The chapter touches on the balance between automation and human judgment, emphasizing that automation isn't a silver bullet. The Diskerase incident is a critical lesson on how automation can fail at scale if not properly checked.

It started with the value of automation, then move into use cases, the hierarchy, case studies, challenges, and recommendations. Mention the key examples and the overarching message that automation requires thoughtful implementation and continuous improvement.

Definitions

  1. The Role of Automation in SRE: Automation is a force multiplier for SRE, enhancing consistency, scalability, and efficiency while reducing human error and toil. It is not a panacea but a tool requiring thoughtful application to avoid introducing new risks.

  2. Core Value Propositions of Automation:

    • Consistency: Eliminates variability in repetitive tasks (e.g., user account creation, backups).
    • Platformization: Centralizes processes, enabling extensibility, metrics, and reusable solutions.
    • Faster Repairs (MTTR): Automated fault resolution reduces downtime for common issues.
    • Speed: Machines outperform humans in executing tasks like failovers or deployments.
    • Time Savings: Decouples operations from specific engineers, democratizing task execution.
  3. Hierarchy of Automation Classes: Automation evolves through stages:

    1. Manual processes (e.g., manual database failover).
    2. Externally maintained scripts (e.g., SRE-owned failover scripts).
    3. Generic automation tools (e.g., shared failover frameworks).
    4. Internally maintained automation (e.g., self-contained system-specific tools).
    5. Autonomous systems (e.g., self-healing databases requiring no human intervention).
  4. Case Studies:

    • Ads Database (MySQL on Borg): Automated failover via Decider reduced failover time from 30–90 minutes to <30 seconds, enabling 95% reduction in operational toil. This allowed SREs to focus on higher-value tasks and hardware optimization.
    • Cluster Turnup Automation: Early brittle scripts evolved into Prodtest (a testing framework) and later Admin Servers with RPC-based APIs. This shifted cluster turnup from a manual, error-prone process to a service-oriented, idempotent workflow.
  5. Challenges and Lessons:

    • Brittle Automation: Early scripts (e.g., shell scripts) became technical debt.
    • Idempotency: Fixes must be safe to run repeatedly (e.g., Prodtest’s test-fix-retest loops).
    • Security and Ownership: Transitioning to ACL-driven RPCs (Admin Servers) reduced risks and enforced accountability.
    • Failure at Scale: The Diskerase incident highlighted the need for sanity checks and rate limiting in automation.
  6. Reliability and Autonomous Systems: Automation must prioritize reliability. Autonomous systems (e.g., Borg) treat clusters as a "warehouse-scale computer," abstracting physical machines into resource pools. APIs and distributed system principles enable self-healing and resilience.

  7. Recommendations:

    • Design systems for automation from the start (APIs, decoupled components).
    • Balance automation with human oversight to avoid complacency.
    • Regularly audit and refine automation to handle edge cases.

Conclusion: Automation is essential for managing large-scale systems but requires careful design, ownership, and iteration. Google’s journey—from manual processes to autonomous systems—demonstrates that automation’s true value lies in enabling reliability, scalability, and engineer productivity. However, it must be paired with rigorous testing, observability, and a culture of continuous improvement to avoid pitfalls.

Extended Definitions

Continuous Deployment and Continuous Delivery

1. Continuous Delivery (CD)

  • Definition: Ensures code changes are always in a deployable state and can be released to production at any time, but requires manual approval for the final deployment step.
  • Key Features:
    • Automated testing, integration, and staging.
    • Code is production-ready but deployed manually (e.g., via a button click).
    • Focuses on risk mitigation by allowing human oversight for business or regulatory reasons.
  • Example from SRE Book (Chapter 7):
    • A canary deployment where changes are rolled out to 5% of users, monitored via SLOs, and manually approved for full rollout.

2. Continuous Deployment (CD)

  • Definition: Automates the entire pipeline, including the final deployment to production, without human intervention.
  • Key Features:
    • Every code change that passes automated tests is immediately deployed to production.
    • Requires rigorous automation, observability, and SLO-driven safeguards (e.g., auto-rollback if SLOs are violated).
  • Example from SRE Book (Chapter 7):
    • Automated failover systems like Decider (for MySQL on Borg) or self-healing clusters in Borg, where no human action is needed for deployment or recovery.

Key Differences

Aspect Continuous Delivery Continuous Deployment
Final Deployment Step Manual approval required. Fully automated (no human intervention).
Risk Tolerance Lower risk (humans validate critical changes). Higher trust in automation and testing.
Use Case Regulated industries, complex systems. High-velocity environments (e.g., SaaS apps).
SRE Alignment Aligns with error budget policies (Chapter 3): Deployments paused if SLOs are breached. Requires autonomous systems (Chapter 7) with self-healing and auto-rollback.

Why It Matters in SRE

  • Continuous Delivery balances automation with human judgment, aligning with SRE principles like error budgets (Chapter 3). Teams can halt deployments if reliability targets (SLOs) are at risk.
  • Continuous Deployment reflects Google’s "warehouse-scale computer" philosophy (Chapter 7), where systems like Borg automate infrastructure management, treating clusters as APIs and enabling autonomous operations.

Example Workflow (SRE Book):

  1. Code commit → Automated tests (CI).
  2. Artifact built → Canary deployment (5% traffic).
  3. Delivery: Manual approval for full rollout.
  4. Deployment: Auto-deploy to 100% if SLOs are met.

In summary, Delivery stops at the production gate, while Deployment pushes through it automatically. Both rely on automation and observability but differ in their tolerance for human intervention.