AKS Cluster Upgrade - From Manual to Automation

Sahana JC
Sahana JC
4 min read
Posted on March 20, 2023
AKS Cluster Upgrade - From Manual to Automation

Problem Statement

With Azure Kubernetes Service (AKS) as a managed Kubernetes offering, periodic upgrades to the latest Kubernetes version are shared responsibilities. The end-to-end DevOps process requires staying on supported versions, having the latest security updates, and moving away from deprecated versions on time.

Upgrading the AKS cluster manually from the Azure portal involves the possibility of serving downtime, slow traffic shifts, and stress on completing the process on time.

Although the auto-upgrade feature can ensure that the clusters always stay updated and doesn't miss the latest AKS features or patches from AKS and upstream Kubernetes, they require a maintenance window. Cordon, Drain, and restart strategies do not match our use case since most of our online serving traffic is handled by spot node pools.


The Solution

As a middle ground we automate the manual AKS cluster upgrades with Terraform.


The Earlier Approach

Our online traffic in each Azure region is handled by 3 AKS clusters and each of them consume 33% of traffic. During the cycle of cluster upgrade, we shift traffic from the target cluster to the remaining 2 AKS and bring down the target cluster. We recreate the target cluster with reimaged nodes and slowly move the traffic back to the target post validation. The critical step here is the movement of traffic.


Our New Approach

We spin up a new cluster with the latest AKS version in the required region using automation. This also involves deploying the other required components like Prometheus. Shift 1% traffic, monitor and validate, and then finally move the entire traffic.


Templatize Infra as Code (IaC)

AKS cluster is built and managed via Terraform.

The upgrade strategy involves the creation of a new cluster using the maximum possible automation.

Region-wise template Terraform script is created.

infrastructure/terraform/microservices/-template/
- terraform.tfvars
- .terraform-version
- iac.json
- main.tf
- _.tf
- variables.tf

Interactive Shell script prompts input from the user.

It performs a find and replace action on the region-specific template and then creates a Pull request to create a new AKS cluster.

$ sh ./create-new-aks-cluster.bash
Enter the target region among region1, region2, region3:
region2
Login to https://github.com// and create pull request named <$new_branch>.
Navigate to infrastructure/terraform/microservices/${new_AKS_cluster}/ folder, Add a commit to update subnet, dns_service_ip and service_cidr

We can update the subnet, service_cidr, DNS, and other variables, and review resource definition.

Our GitHub integration allows us to provision the resources using Terraform integration.


Conclusion

The new approach is beneficial in the following terms:

  • No stress on the duration of the activity or having a maintenance window

  • Reduces effort in traffic shift, since we only shift the traffic once instead of a two-way shift

  • Minimal or zero downtime

  • Speed up the process of upgrading from days to hours

This has the minimal downside of requiring additional code cleanup.


Key Takeaways

Every single component in the cloud infra should be provisioned by code (IaC). This enables automation wherever possible and keeps the management of the infra to one place. Big win!