With Azure Kubernetes Service (AKS) as a managed Kubernetes offering, periodic upgrades to the latest Kubernetes version are shared responsibilities. The end-to-end DevOps process requires staying on supported versions, having the latest security updates, and moving away from deprecated versions on time.
Upgrading the AKS cluster manually from the Azure portal involves the possibility of serving downtime, slow traffic shifts, and stress on completing the process on time.
Although the auto-upgrade feature can ensure that the clusters always stay updated and doesn't miss the latest AKS features or patches from AKS and upstream Kubernetes, they require a maintenance window. Cordon, Drain, and restart strategies do not match our use case since most of our online serving traffic is handled by spot node pools.
As a middle ground we automate the manual AKS cluster upgrades with Terraform.
Our online traffic in each Azure region is handled by 3 AKS clusters and each of them consume 33% of traffic. During the cycle of cluster upgrade, we shift traffic from the target cluster to the remaining 2 AKS and bring down the target cluster. We recreate the target cluster with reimaged nodes and slowly move the traffic back to the target post validation. The critical step here is the movement of traffic.
We spin up a new cluster with the latest AKS version in the required region using automation. This also involves deploying the other required components like Prometheus. Shift 1% traffic, monitor and validate, and then finally move the entire traffic.
AKS cluster is built and managed via Terraform.
The upgrade strategy involves the creation of a new cluster using the maximum possible automation.
Region-wise template Terraform script is created.
infrastructure/terraform/microservices/
- terraform.tfvars
- .terraform-version
- iac.json
- main.tf
-
- variables.tf
Interactive Shell script prompts input from the user.
It performs a find and replace action on the region-specific template and then creates a Pull request to create a new AKS cluster.
$ sh ./create-new-aks-cluster.bash
Enter the target region among region1, region2, region3:
region2
Login to https://github.com/
Navigate to infrastructure/terraform/microservices/${new_AKS_cluster}/ folder, Add a commit to update subnet, dns_service_ip and service_cidr
We can update the subnet, service_cidr
, DNS, and other variables, and review resource definition.
Our GitHub integration allows us to provision the resources using Terraform integration.
The new approach is beneficial in the following terms:
No stress on the duration of the activity or having a maintenance window
Reduces effort in traffic shift, since we only shift the traffic once instead of a two-way shift
Minimal or zero downtime
Speed up the process of upgrading from days to hours
This has the minimal downside of requiring additional code cleanup.
Every single component in the cloud infra should be provisioned by code (IaC). This enables automation wherever possible and keeps the management of the infra to one place. Big win!
Sign up with your email address to receive news and updates from InMobi Technology