AKS Networking – Ingress and Egress Traffic Flow

In the previous post on AKS Networking, we explored the different networking models available in AKS and how IP strategy, node pool scaling, and control plane connectivity shape a production-ready cluster. Now we move from how the cluster is networked to how traffic actually flows through it.

If networking defines the roads, this post is about traffic patterns, checkpoints, and border control. Understanding traffic flow is essential for reliability, security, performance, and compliance. In this post we’ll explore:

  • north–south vs east–west traffic patterns
  • ingress options and when to use each
  • internal-only exposure patterns
  • outbound (egress) control and compliance design
  • how to design predictable and secure traffic flow

Understanding Traffic Patterns in Kubernetes

Before we talk about tools, we need to talk about traffic patterns.

Like the majority of networking you will see in a traditional Hub-and-Spoke architecture, Kubernetes networking is often described using two directional models.

North–South Traffic

North–south traffic refers to traffic entering or leaving the cluster., so can be ingress (incoming) or egress (outgoing) traffic. Examples include:

Incoming

✔ Users accessing a web app
✔ Mobile apps calling APIs
✔ Partner integrations
✔ External services sending webhooks

Outgoing

✔ Calling SaaS APIs
✔ Accessing external databases
✔ Software updates & dependencies
✔ Payment gateways & third-party services

This traffic crosses trust boundaries and is typically subject to security inspection, routing, and policy enforcement.

East–West Traffic

East–west traffic refers to traffic flowing within the cluster.

Examples include:

  • microservices communicating with each other
  • internal APIs
  • background processing services
  • service mesh traffic

This traffic remains inside the cluster boundary but still requires control and segmentation in production environments.


Ingress: Getting Traffic Into the Cluster

Ingress defines how external clients reach services running inside AKS.

Image Credit: Microsoft

At its simplest, Kubernetes can expose services using a LoadBalancer service type. In production environments, however, ingress controllers provide richer routing, security, and observability capabilities.

Choosing the right ingress approach is one of the most important architectural decisions for external traffic.


Azure Application Gateway + AGIC

Azure Application Gateway with the Application Gateway Ingress Controller (AGIC) provides a native Azure Layer 7 ingress solution.

Image Credit: Microsoft

Application Gateway sits outside the cluster and acts as the HTTP/S entry point. AGIC runs inside AKS and dynamically configures routing based on Kubernetes ingress resources.

Why teams choose it

This approach integrates tightly with Azure networking and security capabilities. It enables Web Application Firewall (WAF) protection, TLS termination, path-based routing, and autoscaling.

Because Application Gateway lives in the VNet, it aligns naturally with enterprise security architectures and centralised inspection requirements.

Trade-offs

Application Gateway introduces an additional Azure resource to manage and incurs additional cost. It is also primarily designed for HTTP/S workloads.

For enterprise, security-sensitive, or internet-facing workloads, it is often the preferred choice.


Application Gateway for Containers

Application Gateway for Containers is a newer Azure-native ingress option designed specifically for Kubernetes environments. Its the natural successor to the traditional Application Gateway + AGIC model.

Image Credit: Microsoft

It integrates directly with Azure networking constructs while remaining highly performant and scalable for container-based workloads.

In practical terms, this approach allows Kubernetes resources to directly define how Application Gateway for Containers routes traffic, while Azure manages the underlying infrastructure and scaling behaviour.

Why teams choose it

Application Gateway for Containers is chosen when teams want the security and enterprise integration of Azure Application Gateway but with tighter alignment to Kubernetes-native APIs.

Because it uses the Gateway API instead of traditional ingress resources, it offers a more expressive and modern way to define traffic routing policies. This is particularly attractive for platform teams building shared Kubernetes environments where traffic routing policies need to be consistent and reusable.

Application Gateway for Containers also provides strong integration with Azure networking, private connectivity, and Web Application Firewall capabilities while improving performance compared to earlier ingress-controller models.

Trade-offs

As a newer offering, Application Gateway for Containers may require teams to become familiar with the Kubernetes Gateway API and its resource model.

There is also an additional Azure-managed infrastructure layer involved, which introduces cost considerations similar to the traditional Application Gateway approach.

However, for organisations building modern AKS platforms, Application Gateway for Containers represents a forward-looking ingress architecture that aligns closely with Kubernetes networking standards.

Jack Stromberg has written an extensive post on the functionality of AGC and the migration paths from AGIC and Ingress, check it out here


NGINX Ingress Controller

The NGINX Ingress Controller is one of the most widely used ingress solutions in Kubernetes. It runs as pods inside the cluster and provides highly flexible routing, TLS handling, and traffic management capabilities.

Image Credit: Microsoft

And its retiring ….. well, at least the managed version is.

Microsoft is retiring the managed NGINX Ingress with the Application Routing add-on, with support ending in November 2026. The upstream Ingress-NGINX project is being deprecated, so the managed offering is being retired.

However, you still have the option to run your own NGINX Ingress inside the cluster. Requires more management overhead, but …..

Why teams choose it

NGINX provides fine-grained routing control and is cloud-agnostic. Teams with existing Kubernetes experience often prefer its flexibility and maturity.

It supports advanced routing patterns, rate limiting, and traffic shaping, making it suitable for complex application architectures.

Trade-offs

Because NGINX runs inside the cluster, you are responsible for scaling, availability, and lifecycle management. Security features such as WAF capabilities require additional configuration or integrations.

NGINX is ideal when flexibility and portability outweigh tight platform integration.


Istio Ingress Gateway

The final ingress approach to cover is the Istio Ingress Gateway, typically deployed as part of a broader service mesh architecture.

When using Istio on AKS, the ingress gateway acts as the entry point for traffic entering the service mesh. It is built on the Envoy proxy and integrates tightly with Istio’s traffic management, security, and observability features.

Rather than acting purely as a simple edge router, the Istio ingress gateway becomes part of the overall service mesh control model. This means that external traffic entering the cluster can be governed by the same policies that control internal service-to-service communication.

Why teams choose it

Teams typically adopt the Istio ingress gateway when they are already using — or planning to use — a service mesh.

One of the main advantages is advanced traffic management. Istio enables sophisticated routing capabilities such as weighted routing, canary deployments, A/B testing, and header-based routing. These patterns are extremely useful in microservice architectures where controlled rollout strategies are required.

Another major benefit is built-in security capabilities. Istio can enforce mutual TLS (mTLS) between services, allowing ingress traffic to integrate directly into a zero-trust communication model across the cluster.

Istio also provides strong observability through integrated telemetry, tracing, and metrics. Because Envoy proxies sit on the traffic path, detailed insight into request flows becomes available without modifying application code.

For platform teams building large-scale internal platforms, these capabilities allow ingress traffic to participate fully in the platform’s traffic policy, security posture, and monitoring framework.

Trade-offs

Istio comes with additional operational complexity. Running a service mesh introduces additional control plane components and sidecar proxies that consume compute and memory resources.

Clusters using Istio typically require careful node pool sizing and resource planning to ensure the mesh infrastructure itself does not compete with application workloads.

Operationally, teams must also understand additional concepts such as virtual services, destination rules, gateways, and mesh policies.

I’ll dive into more detail on the concept of Service Mesh in a future post.


Internal Ingress Patterns

Many production clusters expose workloads internally using private load balancers and internal ingress controllers.

This pattern is common when:

  • services are consumed only within the VNet
  • private APIs support internal platforms
  • regulatory or security controls restrict public exposure

Internal ingress allows organisations to treat AKS as a private application platform rather than a public web hosting surface.


Designing for Ingress Resilience

Ingress controllers are part of the application data path. If ingress fails, applications become unreachable. Production considerations include:

  • running multiple replicas
  • placing ingress pods across availability zones
  • ensuring node pool capacity for scaling
  • monitoring latency and saturation

East–West Traffic and Microservice Communication

Within the cluster, services communicate using Kubernetes Services and DNS.

This abstraction allows pods to scale, restart, and move without breaking connectivity. In production environments, unrestricted east–west traffic can create security and operational risk.

Network Policies allow you to restrict communication between workloads, enabling microsegmentation inside the cluster. This is a foundational step toward zero-trust networking principles.

Some organisations also introduce service meshes to provide:

  • mutual TLS between services
  • traffic observability
  • policy enforcement

While not always necessary, these capabilities become valuable in larger or security-sensitive environments.


Egress: Controlling Outbound Traffic

Outbound traffic is often overlooked during early deployments. However, in production environments, controlling egress is critical for security, compliance, and auditability. Workloads frequently need outbound access for:

  • external APIs
  • package repositories
  • identity providers
  • logging and monitoring services

NAT Gateway and Predictable Outbound IP

With the imminent retirement of Default Outbound Access fast approaching, Microsoft’s general recommendation is to use Azure NAT Gateway to provide a consistent outbound IP address for cluster traffic.

Image Credit: Microsoft

This is essential when external systems require IP allow-listing. It also improves scalability compared to default outbound methods.


Azure Firewall and Centralised Egress Control

Many enterprise environments route outbound traffic through Azure Firewall or network virtual appliances. This enables:

  • traffic inspection
  • policy enforcement
  • logging and auditing
  • domain-based filtering
Image Credit: Microsoft

This pattern supports regulatory and compliance requirements while maintaining central control over external connectivity.


Private Endpoints and Service Access

Whenever possible, Azure PaaS services should be accessed via Private Endpoints. This keeps traffic on the Azure backbone network and prevents exposure to the public internet.

Combining private endpoints with controlled egress significantly reduces the attack surface.


Designing Predictable Traffic Flow

Production AKS platforms favour predictability over convenience.

That means:

  • clearly defined ingress entry points
  • controlled internal service communication
  • centralised outbound routing
  • minimal public exposure

This design improves observability, simplifies troubleshooting, and strengthens security posture.


Aligning Traffic Design with the Azure Well-Architected Framework

Operational Excellence improves when traffic flows are observable and predictable.

Reliability depends on resilient ingress and controlled outbound connectivity.

Security is strengthened through restricted exposure, network policies, and controlled egress.

Cost Optimisation improves when traffic routing avoids unnecessary hops and oversized ingress capacity.


What Comes Next

At this point in the series, we have designed:

  • the AKS architecture
  • networking and IP strategy
  • control plane connectivity
  • ingress, egress, and service traffic flow

In the next post, we turn to identity and access control. Because

  • Networking defines connectivity.
  • Traffic design defines flow.
  • Identity defines trust.

See you on the next post

Microsoft’s Sovereign Cloud Strategy: is it really “Disconnected”?

Image Credit: Microsoft

Microsoft have just announced the General Availability of Disconnected Operations for Azure Local, M365 Local and Foundry Local.

Reading between the lines of the announcement seems to be aimed less at about “offline cloud” and more about Microsoft defining a clearer Sovereign Cloud architecture. While the headline is Azure Local disconnected operations + Microsoft 365 Local + Foundry Local, the bigger story is this:

Microsoft is trying to give customers a sovereign stack that spans infrastructure, productivity, and AI — and lets them choose where on the connectivity spectrum each workload sits.

Lets dig a bit deeper into this.


This is a Sovereign Private Cloud story

The Microsoft Learn page for Sovereign Private Cloud also makes the architecture intent more explicit, and positions that front and center as supporting locally hosted, hybrid, and fully disconnected environments. :

  • Azure Local for infrastructure
  • Microsoft 365 Local for productivity
  • Unified control and lifecycle management
  • Workload mobility between Azure and on-premises
  • Support for hybrid and disconnected deployment models

Announcing Azure Local, M365 Local, and Foundry Local together isn’t just about a bundling of product releases, it’s a shift to a full-stack sovereign operating model:

  • Infrastructure stays local,
  • Productivity stays local,
  • AI inferencing can stay local,
  • Control Plane can be cloud-hosted or on-premises depending on mode.

Azure Local is the foundation — but M365 Local and Foundry Local are the interesting parts

Most people immediately focus on Azure Local (understandably). We now get a local control plane (which is managed by the management appliance) that provides a Azure portal and ARM experience similar to the Azure Portal.

Flipping to disconnected mode means you do lose Azure Virtual Desktop (understandably), but still get options such as VMs, Kubernetes, Container Registry, Key Vault and Azure Policy.

Image Credit: Microsoft/Douglas Phillips

But the more important signal is that Microsoft is extending the same sovereign/disconnected pattern up the stack:

  • Microsoft 365 Local = productivity continuity inside the sovereign boundary
  • Foundry Local = AI inferencing/model capability inside the sovereign boundary

That matters because “sovereign” projects usually fail at one of these layers:

  • Infra is fine, but productivity still leaks to cloud services
  • Infra and productivity are local, but AI requires cloud inference
  • Everything is local, but operations become unmanageable

Microsoft is clearly trying to close those gaps. But are they?


Microsoft 365 Local: what it is, and what cloud use cases it’s trying to replace

What Microsoft 365 Local actually is

Image Credit: Microsoft

The Microsoft Learn page is direct in the positioning of M365 Local:

Microsoft 365 Local runs Exchange Server, SharePoint Server, and Skype for Business Server on customer-owned Azure Local infrastructure with Azure-consistent management, and it supports hybrid and fully disconnected deployments.

It also emphasizes:

  • customer-owned and customer-managed environment
  • data residency / access / compliance control
  • validated reference architecture and hardened baseline
  • certified hardware / partner-led deployment paths

What cloud use cases it’s potentially replacing

Microsoft 365 Local is not a like-for-like replacement for the entire modern Microsoft 365 cloud suite. The cloud use cases it appears to target (email, document collaboration, unified comms) are the ones where organizations would otherwise be pushed toward:

  • Exchange Online (email/calendar)
  • SharePoint Online (document collaboration/intranet)
  • Teams Online (cloud-first collaboration and video/audio conferencing)

Microsoft 365 Local does this using their stack of traditional on-premises server products:

  • Exchange Server
  • SharePoint Server
  • Skype of Business Server

Foundry Local: what it is, and what cloud AI use cases it’s trying to replace

In the sovereign announcement, Microsoft says Foundry Local now supports bringing large multimodal models into fully disconnected sovereign environments, using partner infrastructure (for example NVIDIA-based platforms) so customers can run local AI inferencing within their own boundaries.

What cloud use cases Foundry Local is trying to replace

Microsoft Foundry (cloud) is positioned as the place to design and operate AI apps/agents at scale, with:

  • a large model catalog
  • managed compute deployments
  • serverless API deployments
  • prompt flow orchestration
  • Azure-hosted endpoints/APIs

That means Foundry Local is potentially replacing cloud-hosted AI patterns like:

  • Serverless model APIs hosted by Microsoft
  • Managed compute model hosting in Azure
  • cloud-based prompt/app development pipelines for the inference/runtime side when data cannot leave the operational boundary

Foundry Local is effectively Microsoft’s answer for customers who need:

“Foundry-style AI capabilities, but the model runtime and data path must stay on-prem / inside the sovereign boundary.”

That’s a big gap in the market, and Microsoft is trying to close it.


What this stack is replacing, architecturally

If you zoom out, the trio maps cleanly to three cloud categories:

1) Azure Local disconnected

Potentially replacing: cloud-managed hybrid infrastructure patterns where WAN dependency is still too high
With: local control plane + Azure-consistent management for infra and some Arc-enabled services.

2) Microsoft 365 Local

Potentially replacing: reliance on Microsoft 365 SaaS for core productivity in environments that can’t support that connectivity/risk model
With: on-prem productivity server workloads on Azure Local under customer control.

3) Foundry Local

Potentially replacing: Azure-hosted model inference/serverless AI endpoints for sensitive AI use cases
With: local inferencing and APIs inside the same sovereign boundary.

That’s why this announcement matters: it’s not just infra resilience. It’s a stack-level sovereignty story.


The hard question: what if you need truly fully disconnected?

Microsoft’s wording now includes “fully disconnected” for Sovereign Private Cloud, and the Azure Local disconnected docs are a genuine step forward. But many organizations still need to define “fully disconnected” much more strictly than a marketing phrase ever can.

In practice, “fully disconnected” usually means:

  • no internet path
  • no cloud control plane dependency
  • no cloud identity dependency
  • no cloud telemetry path
  • updates and artifacts moved through approved transfer processes

If that’s your requirement, you need to compare options honestly and look at some alternatives that already fit that narrative.


Fully disconnected alternative 1: Azure Stack Hub

If you want the most explicit Microsoft-native air-gapped model, Azure Stack Hub is still the reference point.

Diagram showing Azure Stack Hub job roles
Image Credit: Microsoft

Microsoft’s Azure Stack Hub docs are very clear:

  • you can deploy and use it without internet connectivity
  • disconnected mode requires AD FS
  • multitenancy is not supported in disconnected mode because it would require Microsoft Entra ID
  • Microsoft describes this as a scenario for use in “factory floors, cruise ships, and mine shafts”

If you were to look at this really closely, that is about as explicit as Microsoft gets to stating that something is “fully disconnected”.

Why it’s still relevant

Azure Stack Hub is often the better fit when the requirement is:

  • pure private cloud
  • internal-only identity
  • no usage data sent to Azure
  • no hybrid dependency as a baseline design

The trade-off

Microsoft also documents the operational compromises in disconnected mode:

  • impaired marketplace flow (manual syndication)
  • no telemetry
  • some extension/tooling limitations
  • constraints around service principals/identity workflows, etc.

That’s the normal cost of real isolation.


Fully disconnected alternative 2: Red Hat OpenShift

If the center of gravity is containers/Kubernetes, OpenShift is one of the strongest mature options for disconnected environments.

Image Credit: Red Hat

Red Hat’s docs are excellent here because they define terms clearly:

  • Disconnected environment = no full internet access
  • Air-gapped network = completely isolated external network
  • Restricted network = limited connectivity (proxies/firewalls, etc.)

That taxonomy is exactly what more cloud vendors should be using.

Red Hat also documents:

  • extra setup is required because OpenShift automates many internet-dependent functions by default
  • preferred disconnected practices (image mirroring, local update service, etc.)
  • a wide range of disconnected install patterns, including on-prem and vSphere-based deployments

What OpenShift is replacing

OpenShift disconnected is often replacing:

  • managed Kubernetes services in public cloud
  • cloud-native CI/CD and image delivery assumptions
  • internet-dependent operator/update workflows

It’s a great fit if your target state is platform engineering and Kubernetes-first operations — but it absolutely requires discipline around mirroring, registries, and update lifecycle.


So where do M365 Local and Foundry Local fit versus these alternatives?

This is the key architecture question that can be answered in a number of ways.

If you want a Microsoft-centric sovereign stack (infra + productivity + AI)

Azure Local + M365 Local + Foundry Local is very compelling, because Microsoft is finally addressing all three layers together:

  • infra continuity
  • productivity continuity
  • AI continuity
    inside one sovereign/private-cloud framing.

That’s the strongest part of the announcement.

If you need “hard air gap” with minimal cloud relationship

Azure Stack Hub is still the clearer Microsoft answer, because the disconnected mode assumptions are explicitly documented (AD FS, no multitenancy, no Azure dependency during operation).

If you need broad private-cloud or Kubernetes-first flexibility

Red Hat OpenShift is a serious alternative, especially when:

  • you already run those stacks
  • your ops teams are built around them
  • your security model is based on internal depots, mirrors, and transfer controls rather than cloud-integrated management

Conclusion

Microsoft is not just shipping “offline features.”, they’re building a Sovereign Private Cloud narrative where:

  • Azure Local covers infrastructure,
  • Microsoft 365 Local covers productivity,
  • Foundry Local covers AI,
  • and customers can choose connected, hybrid, or fully disconnected modes based on mission and risk.

But “fully disconnected” still needs precise architecture definitions in every real project. Because in practice, the right question is never:

“Can this run disconnected?”

It’s:

“Which layer is disconnected, which layer isn’t, and who owns the operational overhead?”

Hope you found this post useful – see you next time

From Containers to Kubernetes Architecture

In the previous post, What Is Azure Kubernetes Service (AKS) and Why Should You Care?, we got an intro to AKS, compared it to Azure PaaS services in terms of asking when is the right choice, and finally spun up an AKS cluster to demonstrate what exactly Microsoft exposes to you in terms of responsibilities.

In this post, we’ll take a step back to first principles and understand why containers and microservices emerged, how Docker changed application delivery, and how those pressures ultimately led to Kubernetes.

Only then does Kubernetes and by extension AKS architecture fully make sense.


From Monoliths to Microservices

If you rewind to the 1990s and early 2000s, most enterprise systems followed a fairly predictable pattern: client/server.

You either had thick desktop clients connecting to a central database server, or you had early web applications running on a handful of physical servers in a data centre. Access was often via terminal services, remote desktop, or tightly controlled internal networks.

Applications were typically deployed as monoliths. One codebase. One deployment artifact. One server—or maybe two, if you were lucky enough to have a test environment.

Infrastructure and application were deeply intertwined. If you needed more capacity, you bought another server. If you needed to update the application, you scheduled downtime. And this wasn’t like the downtime we know today – this could run into days, normally public holiday weekends where you had an extra day. Think you’re going to be having Christmas dinner or opening Easter eggs? Nope – thtere’s an upgrade on those weekends!

This model worked in a world where:

  • Release cycles were measured in months
  • Scale was predictable
  • Users were primarily internal or regionally constrained

But as the web matured in the mid-2000s, and SaaS became mainstream, expectations changed.


Virtualisation and Early Cloud

Virtual machines were the first major shift.

Instead of deploying directly to physical hardware, we began deploying to hypervisors. Infrastructure became more flexible. Provisioning times dropped from weeks to hours, and rollback of changes became easier too which de-risked the deployment process.

Then around 2008–2012, public cloud platforms began gaining serious enterprise traction. Infrastructure became API-driven. You could provision compute with a script instead of a purchase order.

Despite these changes, the application model was largely the same. We were still deploying monoliths—just onto virtual machines instead of physical servers.

The client/server model had evolved into a browser/server model, but the deployment unit was still large, tightly coupled, and difficult to scale independently.


The Shift to Microservices

Around the early 2010s, as organisations like Netflix, Amazon, and Google shared their scaling stories, the industry began embracing microservices more seriously.

Instead of a single large deployment, applications were broken into smaller services. Each service had:

  • A well-defined API boundary
  • Its own lifecycle
  • Independent scaling characteristics

This made sense in a world of global users and continuous delivery.

However, it introduced new complexity. You were no longer deploying one application to one server. You might be deploying 50 services across 20 machines. Suddenly, your infrastructure wasn’t just hosting an app—it was hosting an ecosystem.

And this is where the packaging problem became painfully obvious.


Docker and the Rise of Containers

Docker answered the packaging problem.

Containers weren’t new. Linux containers had existed in various forms for years. But Docker made them usable, portable, and developer-friendly.

Instead of saying “it works on my machine,” developers could now package:

  • Their application code
  • The runtime
  • All dependencies
  • Configuration

Into a single container image. That image could run on a laptop, in a data centre, or in the cloud—consistently. This was a major shift in the developer-to-operations contract.

The old model:

  • Developers handed over code
  • Operations teams configured servers
  • Problems emerged somewhere in between

The container model:

  • Developers handed over a runnable artifact
  • Operations teams provided a runtime environment

But Docker alone wasn’t enough.

Running a handful of containers on a single VM was manageable. Running hundreds across dozens of machines? That required coordination.

We had solved packaging. We had not solved orchestration. As container adoption increased, a new challenge emerged:

Containers are easy. Running containers at scale is not.


Why Kubernetes Emerged

Kubernetes emerged to solve the orchestration problem.

Instead of manually deciding where containers should run, Kubernetes introduced a declarative model. You define the desired state of your system—how many replicas, what resources, what networking—and Kubernetes continuously works to make reality match that description.

This was a profound architectural shift.

It moved us from:

  • Logging into servers via SSH
  • Manually restarting services
  • Writing custom scaling scripts

To:

  • Describing infrastructure and workloads declaratively
  • Letting control loops reconcile state
  • Treating servers as replaceable capacity

The access model changed as well. Instead of remote desktop or SSH being the primary control mechanism, the Kubernetes API became the centre of gravity. Everything talks to the API server.

This shift—from imperative scripts to declarative configuration—is one of the most important architectural changes Kubernetes introduced.


Core Kubernetes Architecture

To understand AKS, you first need to understand core Kubernetes components.

At its heart, Kubernetes is split into two logical areas: the control plane and the worker nodes.

The Control Plane – The Brain of the Cluster

The control plane is the brain of the cluster. It makes decisions, enforces state, and exposes the Kubernetes API.

Key components include:

API Server

The API server is the front door. Whether you use kubectl, a CI/CD pipeline, or a GitOps tool, every request flows through the API server. It validates requests and persists changes.

  • Entry point for all Kubernetes operations
  • Validates and processes requests
  • Exposes the Kubernetes API

Everything—kubectl, CI/CD pipelines, controllers—talks to the API server.

etcd

Behind the scenes sits etcd, a distributed key-value store that acts as the source of truth. It stores the desired and current state of the cluster. If etcd becomes unavailable, the cluster effectively loses its memory.

  • Distributed key-value store
  • Holds the desired and current state of the cluster
  • Source of truth for Kubernetes

If etcd is unhealthy, the cluster cannot function correctly.

Scheduler

The scheduler is responsible for deciding where workloads run. When you create a pod, the scheduler evaluates resource availability and constraints before assigning it to a node.

  • Decides which node a pod should run on
  • Considers resource availability, constraints, and policies

Controller Manager

The controller manager runs continuous reconciliation loops. It constantly compares the desired state (for example, “I want three replicas”) with the current state. If a pod crashes, the controller ensures another is created.

  • Runs control loops
  • Continuously checks actual state vs desired state
  • Takes action to reconcile differences

This combination is what makes Kubernetes self-healing and declarative.


Worker Nodes – Where Work Actually Happens

Worker nodes are where your workloads actually run.

Each node contains:

kubelet

Each node runs a kubelet, which acts as the local agent communicating with the control plane. It ensures that the containers defined in pod specifications are actually running.

  • Agent running on each node
  • Ensures containers described in pod specs are running
  • Reports node and pod status back to the control plane

Container Runtime

Underneath that sits the container runtime—most commonly containerd today. This is what actually starts and stops containers.

  • Responsible for running containers
  • Historically Docker, now containerd in most environments

kube-proxy

Networking between services is handled through Kubernetes networking constructs and components such as kube-proxy, which manages traffic rules.

  • Handles networking rules
  • Enables service-to-service communication n

Pods, Services, and Deployments

Above this infrastructure layer, Kubernetes introduces abstractions like pods, deployments, and services. These abstractions allow you to reason about applications instead of machines.

Pods

  • Smallest deployable unit in Kubernetes
  • One or more containers sharing networking and storage

Deployments

  • Define how pods are created and updated
  • Enable rolling updates and rollback
  • Maintain desired replica counts

Services

  • Provide stable networking endpoints
  • Abstract away individual pod lifecycles

You don’t deploy to a server. You declare a deployment. You don’t track IP addresses. You define a service.

How This Maps to Azure Kubernetes Service (AKS)

AKS does not change Kubernetes—it operationalises it. The Kubernetes architecture remains the same, but the responsibility model changes.

In a self-managed cluster, you are responsible for the control plane. You deploy and maintain the API server. You protect and back up etcd. You manage upgrades.

In AKS, Azure operates the control plane for you.

Microsoft manages the API server, etcd, and control plane upgrades. You still interact with Kubernetes in exactly the same way—through the API—but you are no longer responsible for maintaining its most fragile components.

You retain responsibility for worker nodes, node pools, scaling, and workload configuration. That boundary is deliberate.

It aligns directly with the Azure Well-Architected Framework:

  • Operational Excellence through managed control plane abstraction
  • Reduced operational risk and complexity
  • Clear separation between platform and workload responsibility

AKS is Kubernetes—operationalised.


Why This Matters for Production AKS

Every production AKS decision maps back to Kubernetes architecture:

  • Networking choices affect kube-proxy and service routing
  • Node pool design affects scheduling and isolation
  • Scaling decisions interact with controllers and the scheduler

Without understanding the underlying architecture, AKS can feel opaque.

With that understanding, it becomes predictable.


What Comes Next

Now that we understand:

  • Why containers emerged
  • Why Kubernetes exists
  • How Kubernetes is architected
  • How AKS maps to that architecture

We’re ready to start making design decisions.

In the next post, we’ll move into AKS architecture fundamentals, including:

  • Control plane and data plane separation
  • System vs user node pools
  • Regional design and availability considerations

See you on the next post

What Is Azure Kubernetes Service (AKS) and Why Should You Care?

In every cloud native architecture discussion you have had over the last few years or are going to have in the coming years, you can be guaranteed that someone has or will introduce Kubernetes as a hosting option on which your solution will run.

There’s also different options when Kubernetes enters the conversation – you can choose to run:

Kubernetes promises portability, scalability, and resilience. In reality, operating Kubernetes yourself is anything but simple.

Have you’ve ever wondered whether Kubernetes is worth the complexity—or how to move from experimentation to something you can confidently run in production?

Me too – so let’s try and answer that question. For anyone who knows me or has followed me for a few years knows, I like to get down to the basics and “start at the start”.

This is the first post is of a blog series where we’ll focus on Azure Kubernetes Service (AKS), while also referencing the core Kubernetes offerings as a reference. The goal of this series is:

By the end (whenever that is – there is no set time or number of posts), we will have designed and built a production‑ready AKS cluster, aligned with the Azure Well‑Architected Framework, and suitable for real‑world enterprise workloads.

With the goal clearly defined, let’s start at the beginning—not by deploying workloads or tuning YAML, but by understanding:

  • Why AKS exists
  • What problems it solves
  • When it’s the right abstraction.

What Is Azure Kubernetes Service (AKS)?

Azure Kubernetes Service (AKS) is a managed Kubernetes platform provided by Microsoft Azure. It delivers a fully supported Kubernetes control plane while abstracting away much of the operational complexity traditionally associated with running Kubernetes yourself.

At a high level:

  • Azure manages the Kubernetes control plane (API server, scheduler, etcd)
  • You manage the worker nodes (VM size, scaling rules, node pools)
  • Kubernetes manages your containers and workloads

This division of responsibility is deliberate. It allows teams to focus on applications and platforms rather than infrastructure mechanics.

You still get:

  • Native Kubernetes APIs
  • Open‑source tooling (kubectl, Helm, GitOps)
  • Portability across environments

But without needing to design, secure, patch, and operate Kubernetes from scratch.

Why Should You Care About AKS?

The short answer:

AKS enables teams to build scalable platforms without becoming Kubernetes operators.

The longer answer depends on the problems you’re solving.

AKS becomes compelling when:

  • You’re building microservices‑based or distributed applications
  • You need horizontal scaling driven by demand
  • You want rolling updates and self‑healing workloads
  • You’re standardising on containers across teams
  • You need deep integration with Azure networking, identity, and security

Compared to running containers directly on virtual machines, AKS introduces:

  • Declarative configuration
  • Built‑in orchestration
  • Fine‑grained resource management
  • A mature ecosystem of tools and patterns

However, this series is not about adopting AKS blindly. Understanding why AKS exists—and when it’s appropriate—is essential before we design anything production‑ready.


AKS vs Azure PaaS Services: Choosing the Right Abstraction

Another common—and more nuanced—question is:

“Why use AKS at all when Azure already has PaaS services like App Service or Azure Container Apps?”

This is an important decision point, and one that shows up frequently in the Azure Architecture Center.

Azure PaaS Services

Azure PaaS offerings such as App Service, Azure Functions, and Azure Container Apps work well when:

  • You want minimal infrastructure management responsibility
  • Your application fits well within opinionated hosting models
  • Scaling and availability can be largely abstracted away
  • You’re optimising for developer velocity over platform control

They provide:

  • Very low operational overhead – the service is an “out of the box” offering where developers can get started immediately.
  • Built-in scaling and availability – scaling comes as part of the service based on demand, and can be configured based on predicted loads.
  • Tight integration with Azure services – integration with tools such as Azure Monitor and Application Insights for monitoring, Defender for Security monitoring and alerting, and Entra for Identity.

For many workloads, this is exactly the right choice.

AKS

AKS becomes the right abstraction when:

  • You need deep control over networking, runtime, and scheduling
  • You’re running complex, multi-service architectures
  • You require custom security, compliance, or isolation models
  • You’re building a shared internal platform rather than a single application

AKS sits between IaaS and fully managed PaaS:

Azure PaaS abstracts the platform for you. AKS lets you build the platform yourself—safely.

This balance of control and abstraction is what makes AKS suitable for production platforms at scale.


Exploring AKS in the Azure Portal

Before designing anything that could be considered “production‑ready”, it’s important to understand what Azure exposes out of the box – so lets spin up an AKS instance using the Azure Portal.

Step 1: Create an AKS Cluster

  • Sign in to the Azure Portal
  • In the search bar at the top, Search for Kubernetes Service
  • When you get to the “Kubernetes center page”, click on “Clusters” on the left menu (it should bring you here automatically). Select Create, and select “Kubernetes cluster”. Note that there are also options for “Automatic Kubernetes cluster” and “Deploy application” – we’ll address those in a later post.
  • Choose your Subscription and Resource Group
  • Enter a Cluster preset configuration, Cluster name and select a Region. You can choose from four different preset configurations which have clear explanations based on your requirements
  • I’ve gone for Dev/Test for the purposes of spinning up this demo cluster.
  • Leave all other options as default for now and click “Next” – we’ll revisit these in detail in later posts.

Step 2: Configure the Node Pool

  • Under Node pools, there is an agentpool automatically added for us. You can change this if needed to select a different VM size, and set a low min/max node count

    This is your first exposure to separating capacity management from application deployment.

    Step 3: Networking

    Under Networking, you will see options for Private/Public Access, and also for Container Networking. This is an important chopice as there are 2 clear options:

    • Azure CNI Overlay – Pods get IPs from a private CIDR address space that is separate from the node VNet.
    • Azure CNI Node Subnet – Pods get IPs directly from the same VNet subnet as the nodes.

    You also have the option to integrate this into your own VNet which you can specify during the cluster creation process.

    Again, we’ll talk more about these options in a later post, but its important to understand the distinction between the two.

    Step 4: Review and Create

    Select Review + Create – note at this point I have not selected any monitoring, security or integration with an Azure Container Registry and am just taking the defaults. Again (you’re probably bored of reading this….), we’ll deal with these in a later post dedicated to each topic.

    Once deployed, explore:

    • Node pools
    • Workloads
    • Services and ingresses
    • Cluster configuration

    Notice how much complexity is hidden – if you scroll back up to the “Azure-managed v Customer-managed” diagram, you have responsibility for managing:

    • Cluster nodes
    • Networking
    • Workloads
    • Storage

    Even though Azure abstracts away responsibility for things like key-value store, scheduler, controller and management of the cluster API, a large amount of responsibility still remains.


    What Comes Next in the Series

    This post sets the foundation for what AKS is and how it looks out of the box using a standard deployment with the “defaults”.

    Over the course of the series, we’ll move through the various concepts which will help to inform us as we move towards making design decisions for production workloads:

    • Kubernetes Architecture Fundamentals (control plane, node pools, and cluster design), and how they look in AKS
    • Networking for Production AKS (VNets, CNI, ingress, and traffic flow)
    • Identity, Security, and Access Control
    • Scaling, Reliability, and Resilience
    • Cost Optimisation and Governance
    • Monitoring, Alerting and Visualizations
    • Alignment with the Azure Well Architected Framework
    • And lots more ……

    See you on the next post!

    Azure Lab Services Is Retiring: What to Use Instead (and How to Plan Your Migration)

    Microsoft has announced that Azure Lab Services will be retired on June 28, 2027. New customer sign-ups have already been disabled as of July 2025, which means the clock is officially ticking for anyone using the service today.

    You can read the official announcement on Microsoft Learn here: https://learn.microsoft.com/en-us/azure/lab-services/retirement-guide

    While 2027 may feel a long way off, now is the time to take action!

    For those of you who have never heard of Azure Lab Services, lets take a look at what it was and how you would have interacted with it (even if you didn’t know you were!).

    What is/was Azure Lab Services?

    Image: Microsoft Learn

    Azure Lab Services allowed you to create labs with infrastructure managed by Azure. The service handles all the infrastructure management, from spinning up virtual machines (VMs) to handling errors and scaling the infrastructure.

    If you’ve ever been on a Microsoft course, participated in a Virtual Training Days course, or attended a course run by a Microsoft MCT, Azure Lab Services is what the trainer would have used to facilitate:

    • Classrooms and training environments
    • Hands-on labs for workshops or certifications
    • Short-lived dev/test environments

    Azure Lab Services was popular because it abstracted away a lot of complexity around building lab or classroom environments. Its retirement doesn’t mean Microsoft is stepping away from virtual labs—it means the responsibility shifts back to architecture choices based on the requirements you have.

    If you or your company is using Azure Lab Services, the transition to a new service is one of those changes where early planning pays off—especially if your labs are tied to academic calendars, training programmes, or fixed budgets.

    So what are the alternatives?

    Microsoft has outlined several supported paths forward. None are a 1:1 replacement, so the “right” option depends on who your users are and how they work. While these solutions aren’t necessarily education-specific, they support a wide range of education and training scenarios.

    Azure Virtual Desktop (AVD)

    Image: Microsoft Learn

    🔗 https://learn.microsoft.com/azure/virtual-desktop/

    AVD is the most flexible option and the closest match for large-scale, shared lab environments. AVD is ideal for providing full desktop and app delivery scenarios and provides the following benefits:

    • Multi-session Windows 10/11, which either Full Desktop or Single App Delivery options
    • Full control over networking, identity, and images. One of the great new features of AVD (still in preview mode) is that you can now use Guest Identities in your AVD environments, which can be really useful for training environments and takes the overhead of user management away.
    • Ideal for training labs with many concurrent users
    • Supports scaling plans to reduce costs outside working hours (check out my blog post on using Scaling Plans in your AVD Environments)

    I also wrote a set of blog posts about setting up your AVD environments from scratch which you can find here and here.

    Windows 365

    🔗 https://learn.microsoft.com/windows-365/

    Windows 365 offers a Cloud PC per user, abstracting away most infrastructure concerns. Cloud PC virtual machines are Microsoft Entra ID joined and support centralized end-to-end management using Microsoft Intune. You assign Cloud PC’s by assigning a license to that user in the same way as you would assign Microsoft 365 licences. The benefits of Windows 365 are:

    • Simple to deploy and manage
    • Predictable per-user pricing
    • Well-suited to classrooms or longer-lived learning environments

    The trade-off is that there is less flexibility and typically higher cost per user than shared AVD environments, as the Cloud PC’s are dedicated to the users and cannot be shared.

    Azure DevTest Labs

    Image: Microsoft Learn

    🔗 https://learn.microsoft.com/azure/devtest-labs/

    A strong option for developer-focused labs, Azure DevTest labs are targeted at enterprise customers. It also has a key difference to the other alternative solutions, its the only one that offers access to Linux VMs as well as Windows VMs.

    • Supports Windows and Linux
    • Built-in auto-shutdown and cost controls
    • Works well for dev/test and experimentation scenarios

    Microsoft Dev Box

    🔗 https://learn.microsoft.com/dev-box/

    Dev Box is aimed squarely at professional developers. It’s ideal for facilitating hands-on learning where training leaders can use Dev Box supported images to create identical virtual machines for trainees. Dev Box virtual machines are Microsoft Entra ID joined and support centralized end-to-end management with Microsoft Intune.

    • High-performance, secure workstations
    • Integrated with developer tools and workflows
    • Excellent for enterprise engineering teams

    However, its important to note that as of November 2025, DevBox is being integrated into Windows365. The service is built on top of Windows365, so Micrsoft has decided to unify the offerings. You can read more about this announcement here but as of November 2025, Microsoft are no longer accepting new DevBox customers – https://learn.microsoft.com/en-us/azure/dev-box/dev-box-windows-365-announcement?wt.mc_id=AZ-MVP-5005255

    When First-Party Options Aren’t Enough

    If you relied heavily on the lab orchestration features of Azure Lab Services (user lifecycle, lab resets, guided experiences), you may want to evaluate partner platforms that build on Azure:

    These solutions provide:

    • Purpose-built virtual lab platforms
    • User management and lab automation
    • Training and certification-oriented workflows

    They add cost, but also significantly reduce operational complexity.

    Comparison: Azure Lab Services Alternatives

    Lets take a look at a comparison of each service showing cost, use cases and strengths:

    ServiceTypical Cost ModelBest Use CasesKey StrengthWhen 3rd Party Tools Are Needed
    Azure Virtual DesktopPay-per-use (compute + storage + licensing)Large classrooms, shared labs, training environmentsMaximum flexibility and scalabilityFor lab orchestration, user lifecycle, guided labs
    Windows 365Per-user, per-monthClassrooms, longer-lived learning PCsSimplicity and predictabilityRarely needed
    Azure DevTest LabsPay-per-use with cost controlsDev/test, experimentation, mixed OS labsCost governanceFor classroom-style delivery
    Microsoft Dev BoxPer-user, per-monthEnterprise developersPerformance and securityNot typical
    Partner PlatformsSubscription + Azure consumptionTraining providers, certification labsTurnkey lab experiencesCore dependency

    Don’t Forget Hybrid Scenarios

    If some labs or dependencies must remain on-premises, you can still modernise your management approach by deploying Azure Virtual Desktop locally and manage using Azure Arc, which will allow you to

    • Apply Azure governance and policies
    • Centralise monitoring and management
    • Transition gradually toward cloud-native designs

    Start Planning Now

    With several budget cycles between now and June 2027, the smartest move is to:

    1. Inventory existing labs and usage patterns
    2. Map them to the closest-fit replacement
    3. Pilot early with a small group of users

    Azure Lab Services isn’t disappearing tomorrow—but waiting until the last minute will almost certainly increase cost, risk, and disruption.

    If you treat this as an architectural evolution rather than a forced migration, you’ll end up with a platform that’s more scalable, more secure, and better aligned with how people actually learn and work today.

    Every new beginning comes from some other beginning’s end – a quick review of 2023

    Today is a bit of a “dud day” – post Xmas, post birthdays (me and my son) , but before the start of a New Year and the inevitable return to work.

    So, its a day for planning for 2024. And naturally, any planning requires some reflection and a look back on what I achieved over the last year.

    Highlights from 2023

    If I’m being honest my head was in a bit of a spin at the start of 2023. I was coming off the high of submitting my first pre-recorded content session to Festive Tech Calendar, but also in the back of my mind I knew a change was coming as I’d made the decision to change jobs.

    I posted the list of goals above on LinkedIn and Twitter (when it was still called that…) on January 2nd, so lets see how I did:

    • Present at both a Conference and User Group – check!
    • Mentor others, work towards MCT – Mentoring was one of the most fulfilling activities I undertook over the last year. The ability to connect with people in the community who need help, advice or just an outsiders view. Its something I would recommend anyone to do. I also learned that mentoring and training are not connected (I may look at the MCT in 2024) – mentoring is more about asking the right questions, being on the same wavelength as your mentees, and understanding their goals to ensure you are aligning and advising them on the correct path.
    • Go deep on Azure Security, DevOps and DevOps Practices – starting a new job this year with a company that is DevSecOps and IAC focused was definitely a massive learning curve and one that I thoroughly enjoyed!
    • AZ-400 and SC-100 Certs – nope! The one certification I passed this year was AZ-500 but to follow on from the previous point, its not all about exams and certifications. I’d feel more confident have a go at the AZ-400 exam now that I have nearly a year’s experience in DevOps, and its something I’ve been saying for a while now – hiring teams aren’t (well, they shouldn’t be!) interested in tons of certifications, they want to see actual experience in the subject which backs the certification.
    • Create Tech Content – check! I was fortunate to be able to submit sessions to both online events and also present live at Global Azure Dublin and South Coast Summit this year. It was also the year when my first LinkedIn Learning course was published (shameless plug, check it out at this link).
    • Run Half Marathon – Sadly no to this one, I made a few attempts and was a week away from my first half-marathon back in March when my knee decided to give up the ghost. Due to work and family commitments, I never returned to this but its back on the list for 2024.
    • Get back to reading books to relax – This is something we all need to do, turn off that screen at night and find time to relax. I’ve done a mix of Tech and Fiction books and hope to continue this trend for 2024.

    By far though, the biggest thing to happen for me this year was when this email landed in my inbox on April Fools Day …..

    I thought it was an April Fools joke. And if my head was spinning, you can imagine how fast it was spinning now!

    For anyone involved in Microsoft technologies or solutions, being awarded the MVP title is a dream that we all aspire to. It’s recognition from Microsoft that you are not only a subject matter expert in your field, but someone who is looked up to by other community members for content. If we look at the official definition from Microsoft:

    The Microsoft Most Valuable Professionals (MVP) program recognizes exceptional community leaders for their technical expertise, leadership, speaking experience, online influence, and commitment to solving real world problems.

    I’m honoured to be part of this group, getting to know people that I looked up and still looked up to, who push me to be a better person each and every day.

    Onwards to 2024!

    So what are my goals for 2024? Well unlike last year where I explicitly said what I was going to do and declared it, this year is different as I’m not entirely sure. But ultimately, it boils down to 3 main questions:

    • What are my community goals?

    The first goal is to do enough to maintain and renew my MVP status for another year. I hope I’ve done enough and will keep working up to the deadline, but you never really know! I have another blog post in the works where I’ll talk about the MVP award, what its meant to me and some general advice from my experiences of my first year of the award.

    I’ve gotten the bug for Public Speaking and want to submit some more sessions to conferences and user groups over the next year. So plan to submit to some CFS, but if anyone wants to have me on a user group, please get in touch!

    I’ve enjoyed mentoring others on their journey, and the fact that they keep coming back means that the mentees have found me useful as well!

    Blogging – this is my 3rd blog post of the year, and my last one was in March! I want get some consistency back into blogging as its something I enjoy doing.

    • What are my learning goals?

    I think like everyone, the last 12 months have been a whirlwind of Copilots and AI. I plan to immerse myself in that over the coming year, while also growing my knowledge of Azure. Another goal is to learn some Power Platform – its a topic I know very little about, but want to know more! After that, the exams and the certs will come!

    • What are my personal goals?

    So unlike last year, I’m not going to declare that I’ll do a half marathon – at least not in public! The plan is to keep reading both tech and fiction books, keep making some time for myself, and to make the most of my time with my family. Because despite how much the job and the community pulls you back in, there is nothing more important and you’ll never have enough family time.

    So thats all from me for 2023 – you’ll be hearing from me again in 2024! Hope you’ve all had a good holiday, and Happy New Year to all!