Triggering alerts when the maximum node count of an autoscaling-enabled node pool has been reached - The Legend of Hanuman Triggering alerts when the maximum node count of an autoscaling-enabled node pool has been reached - The Legend of Hanuman

Triggering alerts when the maximum node count of an autoscaling-enabled node pool has been reached


Table of Contents

Introduction

When running an AKS cluster with autoscaling-enabled node pools, you probably want to get alerted when the auto scaler starts hitting its configured maximum node count. This helps to further tweak and adjust your node pool setup to better match the workload it’s carrying.

It seems there is no predefined signal for this kind of scenario so I had to do a little research on my own. In the following steps, I describe how alerts can be generated and notification mails be sent, whenever an AKS node pool reaches its configured maximum node count. Here is how I solved it.

Step by step

Enable Resource Logging

First, you need to enable diagnostic settings for the resource log type Kubernetes Cluster Autoscaler and feed it to an existing Log Analytics Workspace. So navigate to your AKS cluster “Monitoring > Diagnostic Settings > Add diagnostic setting”.

aV3AaLm

Create an alert rule

Next, we need to create an alert rule. So navigate to “Monitoring > Alerts” and create a new Alert Rule.

Condition section

In the Condition section, choose Custom log search and paste the following KQL query. Make sure you replace Resource with the name of your AKS cluster.

AzureDiagnostics
| where Category == 'cluster-autoscaler'
    and Resource =~ 'aks-azureblue'
    and log_s has 'exceeds node group set capacity, capping to'
    and TimeGenerated >= ago(5min)
| order by TimeGenerated
| project TimeGenerated, log_s

The query will search the cluster-autoscaler category from within the AzureDiagnostic table for log entries containing the string No expansion options. The CAS emits this log string whenever it can’t add any new nodes.

💡 If you have just enabled feeding resource logs to your log analytics workspace, you need to be patient. It can take up to 10 minutes until the first logs show up and become querable!

From within the Alert logic subsection choose the following settings and head over to Actions.

oErnNvA

Actions section

In the Actions section create a new action group, name it and add a notification type of your choice. In my case, I am going to send out an e-mail to myself.

7v35YfW

Details section

Back on the detail section give a name to identify the rule and a descriptive description. Finally, click on Review + Create.

0oBJpE2

Take it for a test drive

Now that everything is in place, it’s time to take the setup for a test drive. We are going to test the rule by forcing the node pool to scale over the limits.

Assuming you have a user node pool with a maximum of 3 nodes with a VM SKU of Standard_B4ms with 4 vCPUs and 16GB of memory each, the following deployment will occupy one node per pod and trigger the alert configured earlier.

So go ahead and create a file called e.g. scaling-demo.yaml and apply it with kubectl apply -f scaling-demo.yaml.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo 
  labels:
    app: scaling-demo
spec:
  replicas: 0
  selector:
    matchLabels:
      app: scaling-demo
  template:
    metadata:
      labels:
        app: scaling-demo
    spec: 
      containers:
      - name:  aks-auto-scaler-demo
        image: mcr.microsoft.com/oss/nginx/nginx:1.15.5-alpine 
        resources:
          limits: 
            cpu: 200m
            memory: 10Gi 
          requests:
            memory: 10Gi
scaling-demo.yaml

Now that the deployment got created let’s scale it beyond the available resources by issuing kubectl scale --replicas=4 deployment/demo.

After a couple of minutes go check your inbox, which should hold a message with the following content.

OcksF2m

Conclusion

I have demonstrated how an alert can be triggered when a configured node pool with enabled autoscaling reaches the maximum node count size.

The proposed solution is limited in that it doesn’t state which node pool exactly is affected. That’s because the log string doesn’t contain that information and some KQL magic would be required to parse and match earlier messages emitted by the cluster autoscaler.

Still, the proposed solution provides added value and notifies you when your workload hits the ceiling. I hope you enjoyed reading my article and appreciate your feedback!

Happy hacking! 😎

Further reading

Query logs from Container insights – Azure Monitor

Container insights collects metrics and log data, and this article describes the records and includes sample queries.

logo ms social

Use the cluster autoscaler in Azure Kubernetes Service (AKS) – Azure Kubernetes Service

Learn how to use the cluster autoscaler to automatically scale your Azure Kubernetes Service (AKS) clusters to meet application demands.

logo ms social

autoscaler/cluster-autoscaler/README.md at master · kubernetes/autoscaler

Autoscaling components for Kubernetes. Contribute to kubernetes/autoscaler development by creating an account on GitHub.

autoscaler

Cheat Sheet – Azure Kubernetes Services

Node Pool Management Reboot a node kubectl get nodesNAME STATUS ROLES AGE VERSIONaks-agentpool-29989922-vmss000003 Ready agent 2d v1.24.6aks-apps-11756085-vmss000000 Ready agent 25m v1.24.6aks-apps-11756085-vmss000001 R…

learn code 2

Cheat Sheet – KQL

The easiest way (c) to search through exceptionsexceptions| where cloud_RoleName == ‘my-cloud-rolename’ and [‘details’] has `search-string` and timestamp > ago (14d) exceptions | where cloud_RoleName == ’my-cloud-rolename`| search `my-search-string` Find the most chatty App Insights data feed…

photo 1483736762161 1d107f3c78e1?crop=entropy&cs=tinysrgb&fit=max&fm=jpg&ixid=MnwxMTc3M3wwfDF8c2VhcmNofDJ8fGRhdGFiYXNlfGVufDB8fHx8MTY0ODEzOTcxMA&ixlib=rb 1.2


Share this content:

I am a passionate blogger with extensive experience in web design. As a seasoned YouTube SEO expert, I have helped numerous creators optimize their content for maximum visibility.

Leave a Comment