
In the previous post on AKS Identity and Access Control, we covered authentication and authorisation, Workload Identity, secrets management, and Zero Trust principles.
Your cluster is now secured! But a cluster you cannot see into is a cluster you cannot operate. In production, pods crash, nodes exhaust resources, latency spikes, and deployments fail silently. Without observability, you are reacting to outages instead of preventing them.
This post covers the full observability stack for AKS: the layers you need to monitor, the Log Analytics tables and tiers to use, the new OpenTelemetry-native ingestion path, and how AKS Automatic changes the defaults.
Observability Layers in AKS
I’ve used my “Onions have Layers, Kubernetes has Layers” meme previously, but the concepts of layers in AKS and Kubernetes in general becomes more visible when it comes to monitoring because there is no “single pane of glass, one-size fits all solution”. AKS monitoring operates across multiple distinct layers, and each layer requires a different set of tools.

Each layer feeds into the others. A node running out of memory (Infrastructure layer) causes pod evictions (Workloads layer), which increase error rates (Applications layer). Full-stack observability means you can trace a user-facing incident from symptom to root cause across all layers.

Control Plane Logs
AKS is a managed service, so you do not have direct access to control plane nodes. Control plane activity is exposed as resource logs in Azure Monitor and enabling them is one of the first things you should do with any production cluster.
They are not collected by default. You must create a Diagnostic Setting on the cluster. Use resource-specific mode when creating the Diagnostic Setting. This routes logs to dedicated tables (AKSAudit, AKSAuditAdmin, AKSControlPlane) instead of the generic AzureDiagnostics table. Only resource-specific mode supports the Basic logs tier, which matters for cost control.
| Category | What It Contains | When to Enable |
| kube-apiserver | All API server requests and responses | When troubleshooting API-level issues |
| kube-audit | Full audit log: all API calls including GET and LIST | When you need a complete interaction trail (high volume, high cost) |
| kube-audit-admin | Audit log scoped to write operations only (create, update, delete) | Recommended for most production clusters — lower cost than kube-audit |
| kube-controller-manager | Reconciliation loops and controller activity | Troubleshooting deployment and resource issues |
| kube-scheduler | Pod scheduling decisions | Diagnosing pending pods and scheduling failures |
| cluster-autoscaler | Scale-out and scale-in events | Always recommended on clusters using autoscaling |
| guard | Entra ID and Azure RBAC authentication audit events | Always recommended when using Entra ID integration |
Infrastructure and Workload Metrics: Managed Prometheus and Grafana
Platform metrics (basic CPU, memory, and pod counts surfaced in the Azure portal) give you a starting point, but they are not enough for production operations. For real observability at the infrastructure and workload level, you need Azure Monitor Managed Service for Prometheus paired with Azure Managed Grafana.
Azure Monitor Managed Service for Prometheus
Managed Prometheus is a fully managed, Prometheus-compatible metrics service backed by an Azure Monitor workspace. It scrapes metrics from your AKS cluster using a containerized Azure Monitor agent deployed as a DaemonSet. There is no Prometheus server to deploy, scale, or maintain.
Key capabilities include:
- Write your own queries or use community dashboards
- Pre-configured recording rules and alert rules for Kubernetes deployed automatically
- Metrics retention for up to 18 months
- Native integration with Azure Managed Grafana for visualisation
- Enabled with –enable-azure-monitor-metrics at cluster creation or update

Azure Managed Grafana
Azure Managed Grafana is a fully managed Grafana instance that connects directly to your Azure Monitor workspace as a data source. It comes pre-loaded with community Kubernetes dashboards covering node health, pod resource consumption, API server performance, and more.
You can link a Grafana workspace to your cluster at the same time you enable Prometheus metrics. A single Azure Managed Grafana instance can serve as a single pane of glass across multiple AKS clusters, all pointing at the same Azure Monitor workspace.

Container Insights: Logs, Events, and Workload Visibility
Container Insights is a feature of Azure Monitor that collects container logs, Kubernetes events, and workload inventory from your AKS cluster and stores them in a Log Analytics workspace. It is the primary tool for understanding what is happening inside your pods and namespaces.
Container Insights and Managed Prometheus work together using the same containerized Azure Monitor agent. Prometheus handles metrics, Container Insights handles logs and events.
What Container Insights Collects
- Container logs: stdout and stderr from all containers, stored in ContainerLogV2 (the recommended schema)
- Kubernetes events: pod restarts, scheduling failures, image pull errors, OOM kills
- Pod and node inventory: workload state, resource requests and limits, namespace breakdown
- Performance data: CPU and memory utilisation at node and container level
Data collection can be customised using Azure Monitor Data Collection Rules (DCRs) to control costs . You can configure collection intervals, exclude namespaces, and select specific tables to reduce ingestion volume.
| Important ContainerLogV2 is the recommended log schema for new clusters. It provides structured fields including pod name, namespace, and container name, making queries significantly easier than the legacy ContainerLog schema. |
Application-Level Observability: Application Insights
Infrastructure and workload observability tells you that a pod is crashing or a node is under pressure. Application Insights tells you why users are seeing errors — which requests are failing, where latency is concentrated, and how services are calling each other.
Application Insights is an application performance monitoring (APM) feature of Azure Monitor. For AKS workloads, there are three instrumentation approaches:
Code-Based Instrumentation with OpenTelemetry
The standard approach is to add the Azure Monitor OpenTelemetry Distro to your application code. This collects requests, dependencies, exceptions, traces, and custom metrics, sending them to an Application Insights resource.
This gives you the Application Map along with Live Metrics for real-time visibility into production traffic.
Automatic Instrumentation (Preview)
When automatic instrumentation is enabled, the Azure Monitor OpenTelemetry Distro is injected into application pods automatically with no code changes required. Instrumentation can be applied on all namespaces or per-deployment.
Native OTLP Ingestion into Azure Monitor (Preview)
This is the recent announcement, and it is a significant shift. Azure Monitor now supports native ingestion of OpenTelemetry Protocol (OTLP) signals directly.
This annoucement is meaningful for a number of reasons., but the main one is that its vendor-neutral, so applications can use the standard open-source OpenTelemetry SDK and OTLP exporter with no Azure-specific code changes or configuration required.
Network Observability
Networking is often the last layer to get proper observability, yet it is frequently the source of hard-to-diagnose issues.
When Managed Prometheus is enabled on Kubernetes 1.29 or later, basic node-level network metrics are collected by default via the Retina-based scraper, covering traffic volume and error rates.
For deeper visibility, including pod-level metrics, DNS tracking, and full flow logs, Container Network Observability (part of Advanced Container Networking Services) provides eBPF-based telemetry and writes results to ContainerNetworkLogs and RetinaNetworkFlowLogs. ACNS is a paid add-on.
Telemetry Data Flow
And breathe! With so many tools collecting from so many sources, it helps to see the full picture:

Log Analytics Tables and Tiers
Anyone who follows me on LinkedIn (sneaky link for those who don’t!) knows that I talk a lot about FinOps and that Log Analytics is the target of a lot of my angst when it comes to Cost Management. For the sake of repeating myself, Log Analytics offers three table tiers, and the right choice for each AKS table can reduce your monitoring bill significantly.
| Tier | Ingestion Cost | Best For |
| Analytics | Standard | Frequently queried data, alerting, dashboards |
| Basic | Significant discount | Verbose logs accessed occasionally |
| Auxiliary | Lowest cost | Long-term retention, rarely queried |
When you send data to Log Analytics from any Azure resource, all tables default to the “Analytics” tier. For AKS which is a high procesing system with multiple layers which can generate a high volume of logs, you need to think about how these will be stored in Log Analytics. Below is a sample of how this should look:
| Table | Source | Tier | Why |
| AKSAudit | kube-audit | ✅ Basic | Very high volume. Compliance and investigation, not real-time alerting |
| AKSAuditAdmin | kube-audit-admin | Analytics | Write operations only. Often used for alerting and security |
| AKSControlPlane | Other control plane logs | Analytics | Operational data used for troubleshooting and alerting |
| ContainerLogV2 | Container Insights | ✅ Basic | Container stdout/stderr. Very high volume. Microsoft recommends Basic |
| KubeEvents | Container Insights | Analytics | Pod restarts, OOM kills, scheduling failures. Critical for alerting |
| KubePodInventory | Container Insights | Analytics | Powers Container Insights UI. Must be Analytics |
| RetinaNetworkFlowLogs | Container Network Observability | ✅ Basic | Switch from Analytics default for cost savings |
Alerting and Recommended Alert Rules
Collecting data is only useful if you act on it. Azure Monitor provides a set of recommended Prometheus-based alert rules for AKS that you can enable with a single action in the portal. These cover the most important cluster health signals:
- Node CPU and memory pressure
- Pod restart rates and CrashLoopBackOff detection
- Pending pods – pods that cannot be scheduled
- Job failures
- Container OOM (out of memory) kills
- PersistentVolume capacity

These rules are backed by Prometheus metrics and stored in your Azure Monitor workspace.
AKS Automatic: How Observability Defaults Change
Everything covered so far assumes an AKS Standard cluster, where observability is opt-in. On a fresh Standard cluster, nothing is enabled by default. AKS Automatic is different. It is a more opinionated, fully managed cluster experience where observability comes preconfigured.
| Component | AKS Standard | AKS Automatic |
| Managed Prometheus | ❌ Optional | ✅ Default at creation |
| Container Insights | ❌ Optional | ✅ Default at creation |
| ACNS Container Network Observability | ❌ Optional (paid) | ✅ Default (portal creation) |
| Managed Grafana workspace | ❌ Optional | ❌ Optional |
| Diagnostic Settings (control plane) | ❌ Optional | ❌ Optional |
| Recommended Prometheus alert rules | ❌ Optional | ❌ Optional |
The TLDR: AKS Automatic gives you a much stronger observability baseline from minute one. But control plane Diagnostic Settings (kube-audit-admin, guard) are still not on by default, and the Log Analytics tier configuration is still your responsibility.
Aligning with the Azure Well-Architected Framework
- Operational Excellence: Full-stack observability means faster to detect and faster to resolve. Prebuilt dashboards (remember, someone needs to be looking at them!) and alert rules (remember, someone needs to act on them and not just have an Outlook rule that puts them in a folder where they are ignored) reduce the time to configure baseline monitoring.
- Reliability: Alerting on node pressure, pending pods, and OOM events allows teams to respond before workloads are disrupted. Kubernetes event collection surfaces early warning signals.
- Security: kube-audit-admin and guard logs provide an audit trail for all API write operations and authentication events, supporting compliance and incident investigation.
- Cost Optimisation: Data Collection Rules allow you to control ingestion volume. Using kube-audit-admin instead of kube-audit, configuring collection intervals, and filtering namespaces can significantly reduce Log Analytics and Prometheus costs.
Conclusion
At this stage in our AKS journey we have designed the AKS architecture, networking, control plane connectivity, traffic flow, identity and access control, and now observability. The cluster is secure, well-networked, and visible.
In the next post we turn to scaling and node management — how AKS handles demand changes, how to design node pools for production workloads, and how the Cluster Autoscaler and KEDA work together to keep costs under control while maintaining availability.
See you on the next post – while you’re waiting for that you can check out the rest of the posts in the series here.