Azure Subnet Delegation: The Three Words That Break Deployments

I’ve been working with a customer who wants to migrate from Azure SQL Server to Azure SQL Managed Instance. It was the right choice for them – they want to manage multiple databases, so moving away from the DTU Model combined with the costs of running each database indepenently made this a simple choice.

So, lets go set up the deployment. We’ll just deploy it into the same subnet, as it will make life easier during the migration phase……

And it failed. So like most teams would, everyone went looking in the usual places:

Was it the NSG?
Was it the route table?
Was there an address space overlap somewhere?
Had DNS been configured incorrectly?
Was there some hidden policy assignment blocking the deployment?

The problem was three words in the Azure documentation that nobody on the team had flagged: requires subnet delegation.

And that was it. A deployment failure caused by something that takes about ninety seconds to fix when you know what you’re looking for.

The frustrating part is not that subnet delegation exists. In fairness, Azure has good reasons for it. The frustrating part is that it often surfaces as a deployment failure that sends you in entirely the wrong direction first.

Terminal Provisioning State. The Azure error for everything that tells you nothing…….

This post is about what subnet delegation actually is, why it breaks deployments in ways that are surprisingly difficult to diagnose, and — more importantly — how to make sure it never catches you out again.

What is Subnet Delegation?

At the simplest level, subnet delegation is Azure’s way of saying:

this subnet belongs to this service now.

Not in the sense that you lose visibility of it. Not in the sense that you cannot still apply controls around it. But in the sense that a particular Azure service needs permission to configure aspects of that subnet in order to function properly.

The reason for this is straightforward. Some services need to apply their own network policies, routing rules, and management plane configurations to the subnet. They can’t do that reliably if other resources are competing for the same address space. So Azure introduces the concept of delegation: the subnet is formally assigned to a specific service, and that service becomes the owner.

A delegated subnet belongs to one service. That’s it. No virtual machines, no load balancers, no other PaaS services sharing the space alongside it. The subnet is reserved for the delegated service, not just partially occupied by it.

What Happens When You Get It Wrong

The moment you try to place another resource into a delegated subnet — or deploy a service that requires delegation into a subnet that hasn’t been configured for it — the deployment fails.

And Azure’s error messaging in these situations is not always helpful.

What you typically get is a generic deployment failure (I mean, what the hell does “terminal provisioning state” mean anyway???). The portal or CLI may or may not surface an error that points you at the resource configuration, and the natural instinct is to start checking the things you know: NSG rules, route tables, address space availability. These are the usual suspects in VNet troubleshooting. You work through them methodically and find nothing wrong — because nothing is wrong with them.

What you don’t immediately think to check is the delegation tab on the subnet properties. Why would you? In most VNet troubleshooting scenarios, subnet delegation never comes up. For architects who spend most of their time working with IaaS workloads, it’s simply not part of the mental checklist.

Which Services Require It?

Subnet delegation isn’t a niche requirement for obscure services. It applies to some of the most commonly deployed PaaS workloads in enterprise Azure environments.

At this point, its important to distinguish between “Dedicated” and “Delegated” subnets. Some Azure services such as Bastion, Firewall and VNET Gateway have specific naming and sizing requirements for the subnets that they live in, which means they are dedicated.

I’ve tried to summarize in the table below the services that need both dedicated and delegated subnets. The list may or may not be exhaustive – the reason is that I can’t find a single source of reference on Microsoft Learn or GitHub that shows me what services require delegation. So buddy Copilot may have helped with compiling this list …..

Azure service	Requirement	Notes
Compute & containers
Azure Kubernetes Service (AKS) — kubenet	Dedicated	Node and pod CIDRs consume the entire subnet; mixing breaks routing
AKS — Azure CNI	Dedicated	Each pod gets a VNet IP; subnet exhaustion risk with shared use
Azure Container Instances (ACI)	Delegation	Delegate to `Microsoft.ContainerInstance/containerGroups`
Azure App Service / Function App (VNet Integration)	Delegation	Delegate to `Microsoft.Web/serverFarms`; /26 or larger recommended
Azure Batch (simplified node communication)	Delegation	Delegate to `Microsoft.Batch/batchAccounts`
Networking & gateways
Azure VPN Gateway	Dedicated	Subnet must be named `GatewaySubnet`
Azure ExpressRoute Gateway	Dedicated	Also uses `GatewaySubnet`; can co-exist with VPN Gateway in same subnet
Azure Application Gateway v1/v2	Dedicated	Subnet must contain only Application Gateway instances
Azure Firewall	Dedicated	Subnet must be named `AzureFirewallSubnet`; /26 minimum
Azure Firewall Management	Dedicated	Requires separate `AzureFirewallManagementSubnet`; /26 minimum
Azure Bastion	Dedicated	Subnet must be named `AzureBastionSubnet`; /26 minimum
Azure Route Server	Dedicated	Subnet must be named `RouteServerSubnet`; /27 minimum
Azure NAT Gateway	Delegation	Associated via subnet property, not a formal delegation; can share subnet
Azure API Management (internal/external VNet mode)	Dedicated	Recommended dedicated; NSG and UDR requirements make sharing impractical
Databases & analytics
Azure SQL Managed Instance	Both	Dedicated subnet + delegate to `Microsoft.Sql/managedInstances`; /27 minimum
Azure Database for MySQL Flexible Server	Both	Dedicated subnet + delegate to `Microsoft.DBforMySQL/flexibleServers`
Azure Database for PostgreSQL Flexible Server	Both	Dedicated subnet + delegate to `Microsoft.DBforPostgreSQL/flexibleServers`
Azure Cosmos DB (managed private endpoint)	Delegation	Delegate to `Microsoft.AzureCosmosDB/clusters` for dedicated gateway
Azure HDInsight	Dedicated	Complex NSG rules make sharing unsafe; dedicated strongly recommended
Azure Databricks (VNet injection)	Both	Two dedicated subnets (public + private); delegate both to `Microsoft.Databricks/workspaces`
Azure Synapse Analytics (managed VNet)	Delegation	Delegate to `Microsoft.Synapse/workspaces`
Integration & security
Azure Logic Apps (Standard, VNet Integration)	Delegation	Delegate to `Microsoft.Web/serverFarms`; same as App Service
Azure API Management (Premium, VNet injected)	Dedicated	One subnet per deployment region; /29 or larger
Azure NetApp Files	Both	Dedicated subnet + delegate to `Microsoft.Netapp/volumes`; /28 minimum
Azure Machine Learning compute clusters	Dedicated	Dedicated subnet recommended to isolate training workloads
Azure Spring Apps	Both	Two dedicated subnets (service runtime + apps); delegate to `Microsoft.AppPlatform/Spring`

If you’re building Landing Zones for enterprise workloads, you will encounter a significant number of these. Quite possibly in the same deployment cycle.

It’s also worth noting that Microsoft surfaces the delegation identifier strings (like Microsoft.Sql/managedInstances) in the portal when you configure a subnet — but only once you know to look there. These identifiers are also what you’ll specify in your IaC templates, so knowing the right string for each service before you deploy is part of the preparation work.

Why This Catches Architects Out

There’s a pattern worth naming here, because it’s the reason this catches people who really should know better — including architects who’ve been working in Azure for years.

When you build a Solution on Azure or any other cloud platform, you make a lot of network design decisions up front: address space, subnets, NSGs, route tables, peerings, DNS. These decisions form a mental model of the network, and that model tends to stay fairly stable once the design is locked.

Subnet delegation is easy to miss in that process because it isn’t a networking concept in the traditional sense. You’re not configuring routing, access control, or address space. You’re assigning ownership of a subnet to a service. That’s a different kind of decision, and it lives in a different part of the portal to everything else you’re configuring.

During a deployment, when the pressure is on and the clock is running, nobody goes back to check the delegation tab unless they already know delegation is the issue. And you only know delegation is the issue once you’ve already ruled out everything else.

What the Fix Actually Looks Like

Once you know what you’re looking for, the resolution is straightforward.

In the Azure Portal, navigate to the subnet, open the delegation settings, and assign the appropriate service. The delegation options available correspond to the services that support or require it — you select the right one, save, and retry the deployment.

That’s it. That’s the ninety-second fix.

In Terraform, it looks like this:

			
delegation {
  name = "sql-managed-instance-delegation"
  service_delegation {
    name = "Microsoft.Sql/managedInstances"
    actions = [
      "Microsoft.Network/virtualNetworks/subnets/join/action",
      "Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
      "Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action"
    ]
  }
}

		

If you’re deploying via infrastructure-as-code — which you should be for any Landing Zone work — delegation needs to be defined in the subnet configuration from the start, not added reactively when a deployment fails.

Conclusion

Subnet delegation is a small concept with an outsized potential to cause problems during deployments and migrations. The key points:

Some PaaS services require exclusive control over a dedicated, delegated subnet
Deployment failures caused by missing delegation are poorly surfaced by Azure’s error messaging, which means diagnosis takes much longer than the fix
The services that require delegation include SQL Managed Instance, Container Instances, Databricks, Container Apps, and NetApp Files — these are common enterprise workloads, not edge cases
The fix, once identified, takes about ninety seconds
The right response is to make delegation a first-class design consideration in your subnet inventory and Landing Zone documentation

And if you haven’t hit this yet: now you’ll know what you’re looking at before the next deployment.

“Why a Landing Zone?”: How to avoid Azure sprawl from day 1 (and still move fast)

A Landing Zone is never the first thought when a project starts. When the pressure is on to deliver something fast in Azure (or any other cloud environment, the simplest path looks like this:

Create a subscription
Throw resources into a few Resource Groups
Build a VNet (or two)
Add some NSGs
Ship it

Its a good approach ….. for a Proof Of Concept ….

Here’s the problem though: POC’s keep going and turn into Production environments. Because “we need to go fast….”.

What begins as speed often turns into sprawl, and this isn’t a problem until 30/60/180 days later, when you’ve got multiple teams, multiple environments, and everyone has been “needing to go fast”. And its all originated from that first POC …..

This post is about the pain points that appear when you skip foundations, and more importantly, how you can avoid them from day 1, using the Azure Landing Zone reference architectures as your guardrails and your blueprint.

This is always how it starts….

The business says:

“We need this workload live in Azure quickly.”

The delivery team says:

“No problem. We’ll deploy the services into a Resource Group, lock down the VNet with NSGs, and we’ll worry about the platform stuff later.”

Ops and Security quietly panic (or as per the above example, get thrown out the window….), but everyone’s under pressure, so you crack on.

At this point nobody is trying to build a mess. Everyone is “trying” to do the right thing. But the POC you build in those early days has a habit of becoming “the environment” — the one you’re still using a year later, except now it’s full of exceptions, one-off decisions, and “temporary” fixes that never got undone.

The myth: “Resource Groups + VNets + NSGs = foundation”

Resource Groups are useful. VNets are essential. NSGs absolutely have their place.

But if your “platform strategy” starts and ends there, you haven’t built a foundation — you’ve built a starting configuration.

Azure Landing Zones exist to give you that repeatable foundation: a scalable, modular architecture with consistent controls that can be applied across subscriptions as you grow.

The pain points that show up after the first few workloads

1) Governance drift (a.k.a. “every team invents their own standards”)

You start with one naming convention. Then a second team arrives and uses something else. Tags are optional, so they’re inconsistent. Ownership becomes unclear. Cost reporting turns into detective work.

Then you try to introduce standards later and discover:

Hundreds of resources without tags
Naming patterns that can’t be fixed without redeploying and breaking things
“Environment” means different things depending on who you ask

The best time to enforce consistency is before you have 500 things deployed. Landing Zones bring governance forward. Not as a blocker, but as a baseline: policies, conventions, and scopes that make growth predictable.

2) RBAC sprawl (“temporary Owner” becomes permanent risk)

If you’ve ever inherited an Azure estate, environments tend to have patterns like:

“Give them Owner, we’ll tighten it later.”
“Add this service principal as Contributor everywhere just to get the pipeline working.”
“We need to unblock the vendor… give them access for now.”

Fast-forward a few months and you have:

Too many people with too much privilege
No clean separation between platform access and workload access
Audits and access reviews that are painful and slow

This is where Landing Zones help in a very simple way. The platform team owns the platform. Workload teams own their workloads. And the boundaries are designed into the management group and subscription model, not “managed” by tribal knowledge.

3) Network entropy (“just one more VNet”)

Networking is where improvisation becomes expensive. It starts with:

a VNet for the first app
a second VNet for the next one
a peering here
another peering there
and then one day someone asks: “What can talk to what?”

And nobody can answer confidently without opening a diagram that looks like spaghetti.

The Azure guidance here is very clear: adopt a deliberate topology (commonly hub-and-spoke) so you centralise shared services, inspection, and connectivity patterns.

4) Subscription blast radius (“one subscription becomes the junk drawer”)

This is one of the biggest “resource group isn’t enough” realities. Resource Groups are not strong boundaries for:

quotas and limits
policy scope management at scale
RBAC complexity
cost separation across teams/products
incident and breach containment

When everything lives in one subscription, one bad decision has a very wide blast radius. Landing Zones push you toward using subscriptions as a unit of scale, and setting up management groups so you can apply guardrails consistently across them.

So what is a Landing Zone, practically?

In a nutshell, a Landing Zone is the foundation to everything you will do in future in your cloud estate.

The platform team builds a standard, secure, repeatable environment. Application teams ship fast on top of it, without having to re-invent governance, networking, and security every time.

The Azure Landing Zone reference architecture is opinionated for a reason — it gives you a proven starting point that you tailor to your needs.

And it’s typically structured into two layers:

Platform landing zone

Shared services and controls, such as:

identity and access foundations
connectivity patterns
management and monitoring
security baselines

Application landing zones

Workload subscriptions where teams deploy their apps and services — with autonomy inside guardrails.

This separation is the secret sauce. The platform stays boring and consistent. The workloads move fast.

Avoiding sprawl from day 1: a simple blueprint

If you want the practical “do this first” guidance, here it is.

1) Don’t freestyle: use the design areas as your checklist

Microsoft’s Cloud Adoption Framework breaks landing zone design into clear design areas. Treat these as your “day-1 decisions” checklist.

Even if you don’t implement everything on day 1, you should decide:

Identity and access: who owns what, where privilege lives
Resource organisation: management group hierarchy and subscription model
Network topology: hub-and-spoke / vWAN direction, IP plan, connectivity strategy
Governance: policies, standards, and scope
Management: logging, monitoring, operational ownership

The common failure mode is building workloads first, then trying to reverse-engineer these decisions later.

2) Make subscriptions your unit of scale (and stop treating “one sub” as a platform)

If you want to avoid a single subscription becoming a dumping ground, you need a repeatable way to create new workload subscriptions with the right baseline baked in.

This is where subscription vending comes in.

Subscription vending is basically: “new workload subscriptions are created in a consistent, governed way” — with baseline policies, RBAC, logging hooks, and network integration applied as part of the process.

If you can’t create a new compliant subscription easily, you will end up reusing the first one forever… and that’s how sprawl wins.

3) Choose a network pattern early (then standardise it)

Most of the time, the early win is adopting hub-and-spoke:

spokes for workloads
a hub for shared services and central control
consistent ingress/egress and inspection patterns

The point isn’t that hub-and-spoke is “cool” – it gives you a consistent story for connectivity and control.

4) Guardrails that don’t kill speed

This is where people get nervous. They hear “Landing Zone” and think bureaucracy. But guardrails are only slow when they’re manual. Good guardrails are automated and predictable, like:

policy baselines for common requirements
naming/tagging standards that are enforced early
RBAC patterns that avoid “Owner everywhere”
logging and diagnostics expectations so ops isn’t blind

This is how you enable teams to move quickly without turning your subscription into a free-for-all.

How can you actually implement this?

Don’t build it from scratch. Use the Azure Landing Zone reference architecture as your baseline, then implement via an established approach (and put it in version control from the start). The landing zone architecture is designed to be modular for exactly this reason: you can start small and evolve without redesigning everything.

Treat it like a product:

define what a “new workload environment” looks like
automate the deployment of that baseline
iterate over time

The goal is not to build the perfect enterprise platform on day 1; its to build something that won’t collapse under its own weight when you scale.

A “tomorrow morning” checklist

If you’re reading this and thinking “right, what do I actually do next?”, here are four actions that deliver disproportionate value:

Decide your management group + subscription strategy
Pick your network topology (and standardise it)
Define day-1 guardrails (policy baseline, RBAC patterns, naming/tags, logging hooks)
Set up subscription vending so new workloads start compliant by default

Do those four things, and you’ll avoid the worst kind of Azure sprawl before it starts.

Conclusion

Skipping a Landing Zone might feel like a quick win today.

But if you know the workload is going to grow — more teams, more environments, more services, more scrutiny — then the question isn’t “do we need a landing zone?”

The question is: do we want to pay for foundations now… or pay a lot more later when we (inevitably) lose control?

Hope you enjoyed this post – this is my contribution to this years Azure Spring Clean event organised by Joe Carlyle and Thomas Thornton. Check out the full schedule on the website!

AKS Networking – Ingress and Egress Traffic Flow

In the previous post on AKS Networking, we explored the different networking models available in AKS and how IP strategy, node pool scaling, and control plane connectivity shape a production-ready cluster. Now we move from how the cluster is networked to how traffic actually flows through it.

If networking defines the roads, this post is about traffic patterns, checkpoints, and border control. Understanding traffic flow is essential for reliability, security, performance, and compliance. In this post we’ll explore:

north–south vs east–west traffic patterns
ingress options and when to use each
internal-only exposure patterns
outbound (egress) control and compliance design
how to design predictable and secure traffic flow

Understanding Traffic Patterns in Kubernetes

Before we talk about tools, we need to talk about traffic patterns.

Like the majority of networking you will see in a traditional Hub-and-Spoke architecture, Kubernetes networking is often described using two directional models.

North–South Traffic

North–south traffic refers to traffic entering or leaving the cluster., so can be ingress (incoming) or egress (outgoing) traffic. Examples include:

Incoming

✔ Users accessing a web app
✔ Mobile apps calling APIs
✔ Partner integrations
✔ External services sending webhooks

Outgoing

✔ Calling SaaS APIs
✔ Accessing external databases
✔ Software updates & dependencies
✔ Payment gateways & third-party services

This traffic crosses trust boundaries and is typically subject to security inspection, routing, and policy enforcement.

East–West Traffic

East–west traffic refers to traffic flowing within the cluster.

Examples include:

microservices communicating with each other
internal APIs
background processing services
service mesh traffic

This traffic remains inside the cluster boundary but still requires control and segmentation in production environments.

Ingress: Getting Traffic Into the Cluster

Ingress defines how external clients reach services running inside AKS.

At its simplest, Kubernetes can expose services using a LoadBalancer service type. In production environments, however, ingress controllers provide richer routing, security, and observability capabilities.

Choosing the right ingress approach is one of the most important architectural decisions for external traffic.

Azure Application Gateway + AGIC

Azure Application Gateway with the Application Gateway Ingress Controller (AGIC) provides a native Azure Layer 7 ingress solution.

Application Gateway sits outside the cluster and acts as the HTTP/S entry point. AGIC runs inside AKS and dynamically configures routing based on Kubernetes ingress resources.

Why teams choose it

This approach integrates tightly with Azure networking and security capabilities. It enables Web Application Firewall (WAF) protection, TLS termination, path-based routing, and autoscaling.

Because Application Gateway lives in the VNet, it aligns naturally with enterprise security architectures and centralised inspection requirements.

Trade-offs

Application Gateway introduces an additional Azure resource to manage and incurs additional cost. It is also primarily designed for HTTP/S workloads.

For enterprise, security-sensitive, or internet-facing workloads, it is often the preferred choice.

Application Gateway for Containers

Application Gateway for Containers is a newer Azure-native ingress option designed specifically for Kubernetes environments. Its the natural successor to the traditional Application Gateway + AGIC model.

It integrates directly with Azure networking constructs while remaining highly performant and scalable for container-based workloads.

In practical terms, this approach allows Kubernetes resources to directly define how Application Gateway for Containers routes traffic, while Azure manages the underlying infrastructure and scaling behaviour.

Why teams choose it

Application Gateway for Containers is chosen when teams want the security and enterprise integration of Azure Application Gateway but with tighter alignment to Kubernetes-native APIs.

Because it uses the Gateway API instead of traditional ingress resources, it offers a more expressive and modern way to define traffic routing policies. This is particularly attractive for platform teams building shared Kubernetes environments where traffic routing policies need to be consistent and reusable.

Application Gateway for Containers also provides strong integration with Azure networking, private connectivity, and Web Application Firewall capabilities while improving performance compared to earlier ingress-controller models.

Trade-offs

As a newer offering, Application Gateway for Containers may require teams to become familiar with the Kubernetes Gateway API and its resource model.

There is also an additional Azure-managed infrastructure layer involved, which introduces cost considerations similar to the traditional Application Gateway approach.

However, for organisations building modern AKS platforms, Application Gateway for Containers represents a forward-looking ingress architecture that aligns closely with Kubernetes networking standards.

Jack Stromberg has written an extensive post on the functionality of AGC and the migration paths from AGIC and Ingress, check it out here

NGINX Ingress Controller

The NGINX Ingress Controller is one of the most widely used ingress solutions in Kubernetes. It runs as pods inside the cluster and provides highly flexible routing, TLS handling, and traffic management capabilities.

And its retiring ….. well, at least the managed version is.

Microsoft is retiring the managed NGINX Ingress with the Application Routing add-on, with support ending in November 2026. The upstream Ingress-NGINX project is being deprecated, so the managed offering is being retired.

However, you still have the option to run your own NGINX Ingress inside the cluster. Requires more management overhead, but …..

Why teams choose it

NGINX provides fine-grained routing control and is cloud-agnostic. Teams with existing Kubernetes experience often prefer its flexibility and maturity.

It supports advanced routing patterns, rate limiting, and traffic shaping, making it suitable for complex application architectures.

Trade-offs

Because NGINX runs inside the cluster, you are responsible for scaling, availability, and lifecycle management. Security features such as WAF capabilities require additional configuration or integrations.

NGINX is ideal when flexibility and portability outweigh tight platform integration.

Istio Ingress Gateway

The final ingress approach to cover is the Istio Ingress Gateway, typically deployed as part of a broader service mesh architecture.

When using Istio on AKS, the ingress gateway acts as the entry point for traffic entering the service mesh. It is built on the Envoy proxy and integrates tightly with Istio’s traffic management, security, and observability features.

Rather than acting purely as a simple edge router, the Istio ingress gateway becomes part of the overall service mesh control model. This means that external traffic entering the cluster can be governed by the same policies that control internal service-to-service communication.

Why teams choose it

Teams typically adopt the Istio ingress gateway when they are already using — or planning to use — a service mesh.

One of the main advantages is advanced traffic management. Istio enables sophisticated routing capabilities such as weighted routing, canary deployments, A/B testing, and header-based routing. These patterns are extremely useful in microservice architectures where controlled rollout strategies are required.

Another major benefit is built-in security capabilities. Istio can enforce mutual TLS (mTLS) between services, allowing ingress traffic to integrate directly into a zero-trust communication model across the cluster.

Istio also provides strong observability through integrated telemetry, tracing, and metrics. Because Envoy proxies sit on the traffic path, detailed insight into request flows becomes available without modifying application code.

For platform teams building large-scale internal platforms, these capabilities allow ingress traffic to participate fully in the platform’s traffic policy, security posture, and monitoring framework.

Trade-offs

Istio comes with additional operational complexity. Running a service mesh introduces additional control plane components and sidecar proxies that consume compute and memory resources.

Clusters using Istio typically require careful node pool sizing and resource planning to ensure the mesh infrastructure itself does not compete with application workloads.

Operationally, teams must also understand additional concepts such as virtual services, destination rules, gateways, and mesh policies.

I’ll dive into more detail on the concept of Service Mesh in a future post.

Internal Ingress Patterns

Many production clusters expose workloads internally using private load balancers and internal ingress controllers.

This pattern is common when:

services are consumed only within the VNet
private APIs support internal platforms
regulatory or security controls restrict public exposure

Internal ingress allows organisations to treat AKS as a private application platform rather than a public web hosting surface.

Designing for Ingress Resilience

Ingress controllers are part of the application data path. If ingress fails, applications become unreachable. Production considerations include:

running multiple replicas
placing ingress pods across availability zones
ensuring node pool capacity for scaling
monitoring latency and saturation

East–West Traffic and Microservice Communication

Within the cluster, services communicate using Kubernetes Services and DNS.

This abstraction allows pods to scale, restart, and move without breaking connectivity. In production environments, unrestricted east–west traffic can create security and operational risk.

Network Policies allow you to restrict communication between workloads, enabling microsegmentation inside the cluster. This is a foundational step toward zero-trust networking principles.

Some organisations also introduce service meshes to provide:

mutual TLS between services
traffic observability
policy enforcement

While not always necessary, these capabilities become valuable in larger or security-sensitive environments.

Egress: Controlling Outbound Traffic

Outbound traffic is often overlooked during early deployments. However, in production environments, controlling egress is critical for security, compliance, and auditability. Workloads frequently need outbound access for:

external APIs
package repositories
identity providers
logging and monitoring services

NAT Gateway and Predictable Outbound IP

With the imminent retirement of Default Outbound Access fast approaching, Microsoft’s general recommendation is to use Azure NAT Gateway to provide a consistent outbound IP address for cluster traffic.

This is essential when external systems require IP allow-listing. It also improves scalability compared to default outbound methods.

Azure Firewall and Centralised Egress Control

Many enterprise environments route outbound traffic through Azure Firewall or network virtual appliances. This enables:

traffic inspection
policy enforcement
logging and auditing
domain-based filtering

This pattern supports regulatory and compliance requirements while maintaining central control over external connectivity.

Private Endpoints and Service Access

Whenever possible, Azure PaaS services should be accessed via Private Endpoints. This keeps traffic on the Azure backbone network and prevents exposure to the public internet.

Combining private endpoints with controlled egress significantly reduces the attack surface.

Designing Predictable Traffic Flow

Production AKS platforms favour predictability over convenience.

That means:

clearly defined ingress entry points
controlled internal service communication
centralised outbound routing
minimal public exposure

This design improves observability, simplifies troubleshooting, and strengthens security posture.

Aligning Traffic Design with the Azure Well-Architected Framework

Operational Excellence improves when traffic flows are observable and predictable.

Reliability depends on resilient ingress and controlled outbound connectivity.

Security is strengthened through restricted exposure, network policies, and controlled egress.

Cost Optimisation improves when traffic routing avoids unnecessary hops and oversized ingress capacity.

What Comes Next

At this point in the series, we have designed:

the AKS architecture
networking and IP strategy
control plane connectivity
ingress, egress, and service traffic flow

In the next post, we turn to identity and access control. Because

Networking defines connectivity.
Traffic design defines flow.
Identity defines trust.

See you on the next post

Microsoft’s Sovereign Cloud Strategy: is it really “Disconnected”?

Microsoft have just announced the General Availability of Disconnected Operations for Azure Local, M365 Local and Foundry Local.

Reading between the lines of the announcement seems to be aimed less at about “offline cloud” and more about Microsoft defining a clearer Sovereign Cloud architecture. While the headline is Azure Local disconnected operations + Microsoft 365 Local + Foundry Local, the bigger story is this:

Microsoft is trying to give customers a sovereign stack that spans infrastructure, productivity, and AI — and lets them choose where on the connectivity spectrum each workload sits.

Lets dig a bit deeper into this.

This is a Sovereign Private Cloud story

The Microsoft Learn page for Sovereign Private Cloud also makes the architecture intent more explicit, and positions that front and center as supporting locally hosted, hybrid, and fully disconnected environments. :

Azure Local for infrastructure
Microsoft 365 Local for productivity
Unified control and lifecycle management
Workload mobility between Azure and on-premises
Support for hybrid and disconnected deployment models

Announcing Azure Local, M365 Local, and Foundry Local together isn’t just about a bundling of product releases, it’s a shift to a full-stack sovereign operating model:

Infrastructure stays local,
Productivity stays local,
AI inferencing can stay local,
Control Plane can be cloud-hosted or on-premises depending on mode.

Azure Local is the foundation — but M365 Local and Foundry Local are the interesting parts

Most people immediately focus on Azure Local (understandably). We now get a local control plane (which is managed by the management appliance) that provides a Azure portal and ARM experience similar to the Azure Portal.

Flipping to disconnected mode means you do lose Azure Virtual Desktop (understandably), but still get options such as VMs, Kubernetes, Container Registry, Key Vault and Azure Policy.

Image Credit: Microsoft/Douglas Phillips

But the more important signal is that Microsoft is extending the same sovereign/disconnected pattern up the stack:

Microsoft 365 Local = productivity continuity inside the sovereign boundary
Foundry Local = AI inferencing/model capability inside the sovereign boundary

That matters because “sovereign” projects usually fail at one of these layers:

Infra is fine, but productivity still leaks to cloud services
Infra and productivity are local, but AI requires cloud inference
Everything is local, but operations become unmanageable

Microsoft is clearly trying to close those gaps. But are they?

Microsoft 365 Local: what it is, and what cloud use cases it’s trying to replace

What Microsoft 365 Local actually is

The Microsoft Learn page is direct in the positioning of M365 Local:

Microsoft 365 Local runs Exchange Server, SharePoint Server, and Skype for Business Server on customer-owned Azure Local infrastructure with Azure-consistent management, and it supports hybrid and fully disconnected deployments.

It also emphasizes:

customer-owned and customer-managed environment
data residency / access / compliance control
validated reference architecture and hardened baseline
certified hardware / partner-led deployment paths

What cloud use cases it’s potentially replacing

Microsoft 365 Local is not a like-for-like replacement for the entire modern Microsoft 365 cloud suite. The cloud use cases it appears to target (email, document collaboration, unified comms) are the ones where organizations would otherwise be pushed toward:

Exchange Online (email/calendar)
SharePoint Online (document collaboration/intranet)
Teams Online (cloud-first collaboration and video/audio conferencing)

Microsoft 365 Local does this using their stack of traditional on-premises server products:

Exchange Server
SharePoint Server
Skype of Business Server

Foundry Local: what it is, and what cloud AI use cases it’s trying to replace

In the sovereign announcement, Microsoft says Foundry Local now supports bringing large multimodal models into fully disconnected sovereign environments, using partner infrastructure (for example NVIDIA-based platforms) so customers can run local AI inferencing within their own boundaries.

What cloud use cases Foundry Local is trying to replace

Microsoft Foundry (cloud) is positioned as the place to design and operate AI apps/agents at scale, with:

a large model catalog
managed compute deployments
serverless API deployments
prompt flow orchestration
Azure-hosted endpoints/APIs

That means Foundry Local is potentially replacing cloud-hosted AI patterns like:

Serverless model APIs hosted by Microsoft
Managed compute model hosting in Azure
cloud-based prompt/app development pipelines for the inference/runtime side when data cannot leave the operational boundary

Foundry Local is effectively Microsoft’s answer for customers who need:

“Foundry-style AI capabilities, but the model runtime and data path must stay on-prem / inside the sovereign boundary.”

That’s a big gap in the market, and Microsoft is trying to close it.

What this stack is replacing, architecturally

If you zoom out, the trio maps cleanly to three cloud categories:

1) Azure Local disconnected

Potentially replacing: cloud-managed hybrid infrastructure patterns where WAN dependency is still too high
With: local control plane + Azure-consistent management for infra and some Arc-enabled services.

2) Microsoft 365 Local

Potentially replacing: reliance on Microsoft 365 SaaS for core productivity in environments that can’t support that connectivity/risk model
With: on-prem productivity server workloads on Azure Local under customer control.

3) Foundry Local

Potentially replacing: Azure-hosted model inference/serverless AI endpoints for sensitive AI use cases
With: local inferencing and APIs inside the same sovereign boundary.

That’s why this announcement matters: it’s not just infra resilience. It’s a stack-level sovereignty story.

The hard question: what if you need truly fully disconnected?

Microsoft’s wording now includes “fully disconnected” for Sovereign Private Cloud, and the Azure Local disconnected docs are a genuine step forward. But many organizations still need to define “fully disconnected” much more strictly than a marketing phrase ever can.

In practice, “fully disconnected” usually means:

no internet path
no cloud control plane dependency
no cloud identity dependency
no cloud telemetry path
updates and artifacts moved through approved transfer processes

If that’s your requirement, you need to compare options honestly and look at some alternatives that already fit that narrative.

Fully disconnected alternative 1: Azure Stack Hub

If you want the most explicit Microsoft-native air-gapped model, Azure Stack Hub is still the reference point.

Diagram showing Azure Stack Hub job roles — Image Credit: Microsoft

Microsoft’s Azure Stack Hub docs are very clear:

you can deploy and use it without internet connectivity
disconnected mode requires AD FS
multitenancy is not supported in disconnected mode because it would require Microsoft Entra ID
Microsoft describes this as a scenario for use in “factory floors, cruise ships, and mine shafts”

If you were to look at this really closely, that is about as explicit as Microsoft gets to stating that something is “fully disconnected”.

Why it’s still relevant

Azure Stack Hub is often the better fit when the requirement is:

pure private cloud
internal-only identity
no usage data sent to Azure
no hybrid dependency as a baseline design

The trade-off

Microsoft also documents the operational compromises in disconnected mode:

impaired marketplace flow (manual syndication)
no telemetry
some extension/tooling limitations
constraints around service principals/identity workflows, etc.

That’s the normal cost of real isolation.

Fully disconnected alternative 2: Red Hat OpenShift

If the center of gravity is containers/Kubernetes, OpenShift is one of the strongest mature options for disconnected environments.

Red Hat’s docs are excellent here because they define terms clearly:

Disconnected environment = no full internet access
Air-gapped network = completely isolated external network
Restricted network = limited connectivity (proxies/firewalls, etc.)

That taxonomy is exactly what more cloud vendors should be using.

Red Hat also documents:

extra setup is required because OpenShift automates many internet-dependent functions by default
preferred disconnected practices (image mirroring, local update service, etc.)
a wide range of disconnected install patterns, including on-prem and vSphere-based deployments

What OpenShift is replacing

OpenShift disconnected is often replacing:

managed Kubernetes services in public cloud
cloud-native CI/CD and image delivery assumptions
internet-dependent operator/update workflows

It’s a great fit if your target state is platform engineering and Kubernetes-first operations — but it absolutely requires discipline around mirroring, registries, and update lifecycle.

So where do M365 Local and Foundry Local fit versus these alternatives?

This is the key architecture question that can be answered in a number of ways.

If you want a Microsoft-centric sovereign stack (infra + productivity + AI)

Azure Local + M365 Local + Foundry Local is very compelling, because Microsoft is finally addressing all three layers together:

infra continuity
productivity continuity
AI continuity
inside one sovereign/private-cloud framing.

That’s the strongest part of the announcement.

If you need “hard air gap” with minimal cloud relationship

Azure Stack Hub is still the clearer Microsoft answer, because the disconnected mode assumptions are explicitly documented (AD FS, no multitenancy, no Azure dependency during operation).

If you need broad private-cloud or Kubernetes-first flexibility

Red Hat OpenShift is a serious alternative, especially when:

you already run those stacks
your ops teams are built around them
your security model is based on internal depots, mirrors, and transfer controls rather than cloud-integrated management

Conclusion

Microsoft is not just shipping “offline features.”, they’re building a Sovereign Private Cloud narrative where:

Azure Local covers infrastructure,
Microsoft 365 Local covers productivity,
Foundry Local covers AI,
and customers can choose connected, hybrid, or fully disconnected modes based on mission and risk.

But “fully disconnected” still needs precise architecture definitions in every real project. Because in practice, the right question is never:

“Can this run disconnected?”

It’s:

“Which layer is disconnected, which layer isn’t, and who owns the operational overhead?”

Hope you found this post useful – see you next time

AKS Networking – Which model should you choose?

In the previous post, we broke down AKS Architecture Fundamentals — control plane vs data plane, node pools, availability zones, and early production guardrails.

Now we move into one of the most consequential design areas in any AKS deployment:

Networking.

If node pools define where workloads run, networking defines how they communicate — internally, externally, and across environments.

Unlike VM sizes or replica counts, networking decisions are difficult to change later. They shape IP planning, security boundaries, hybrid connectivity, and how your platform evolves over time.

This post takes a look at AKS networking by exploring:

The modern networking options available in AKS
Trade-offs between Azure CNI Overlay and Azure CNI Node Subnet
How networking decisions influence node pool sizing and scaling
How the control plane communicates with the data plane

Why Networking in AKS Is Different

With traditional Iaas and PaaS services in Azure, networking is straightforward: a VM or resource gets an IP address in a subnet.

With Kubernetes, things become layered:

Nodes have IP addresses
Pods have IP addresses
Services abstract pod endpoints
Ingress controls external access

AKS integrates all of this into an Azure Virtual Network. That means Kubernetes networking decisions directly impact:

IP address planning
Subnet sizing
Security boundaries
Peering and hybrid connectivity

In production, networking is not just connectivity — it’s architecture.

The Modern AKS Networking Choices

Although there are some legacy models still available for use, if you try to deploy an AKS cluster in the Portal you will see that AKS offers two main networking approaches:

Azure CNI Node Subnet (flat network model)
Azure CNI Overlay (pod overlay networking)

As their names suggest, both use Azure CNI. The difference lies in how pod IP addresses are assigned and routed. Understanding this distinction is essential before you size node pools or define scaling limits.

Azure CNI Node Subnet

This is the traditional Azure CNI model.

Pods receive IP addresses directly from the Azure subnet. From the network’s perspective, pods appear as first-class citizens inside your VNet.

How It Works

Each node consumes IP addresses from the subnet. Each pod scheduled onto that node also consumes an IP from the same subnet. Pods are directly routable across VNets, peered networks, and hybrid connections.

This creates a flat, highly transparent network model.

Why teams choose it

This model aligns naturally with enterprise networking expectations. Security appliances, firewalls, and monitoring tools can see pod IPs directly. Routing is predictable, and hybrid connectivity is straightforward.

If your environment already relies on network inspection, segmentation, or private connectivity, this model integrates cleanly.

Pros

Native VNet integration
Simple routing and peering
Easier integration with existing network appliances
Straightforward hybrid connectivity scenarios
Cleaner alignment with enterprise security tooling

Cons

High IP consumption
Requires careful subnet sizing
Can exhaust address space quickly in large clusters

Trade-offs to consider

The trade-off is IP consumption. Every pod consumes a VNet IP. In large clusters, address space can be exhausted faster than expected. Subnet sizing must account for:

node count
maximum pods per node
autoscaling limits
upgrade surge capacity

This model rewards careful planning and penalises underestimation.

Impact on node pool sizing

With Node Subnet networking, node pool scaling directly consumes IP space.

If a user node pool scales out aggressively and each node supports 30 pods, IP usage grows rapidly. A cluster designed for 100 nodes may require thousands of available IP addresses.

System node pools remain smaller, but they still require headroom for upgrades and system pod scheduling.

Azure CNI Overlay

Azure CNI Overlay is designed to address IP exhaustion challenges while retaining Azure CNI integration.

Pods receive IP addresses from an internal Kubernetes-managed range, not directly from the Azure subnet. Only nodes consume Azure VNet IP addresses.

How It Works

Nodes are addressable within the VNet. Pods use an internal overlay CIDR range. Traffic is routed between nodes, with encapsulation handling pod communication.

From the VNet’s perspective, only nodes consume IP addresses.

Why teams choose it

Overlay networking dramatically reduces pressure on Azure subnet address space. This makes it especially attractive in environments where:

IP ranges are constrained
multiple clusters share network space
growth projections are uncertain

It allows clusters to scale without re-architecting network address ranges.

Pros

Significantly lower Azure IP consumption
Simpler subnet sizing
Useful in environments with constrained IP ranges

Cons

More complex routing
Less transparent network visibility
Additional configuration required for advanced scenarios
Not ideal for large-scale enterprise integration

Trade-offs to consider

Overlay networking introduces an additional routing layer. While largely transparent, it can add complexity when integrating with deep packet inspection, advanced network appliances, or highly customised routing scenarios.

For most modern workloads, however, this complexity is manageable and increasingly common.

Impact on node pool sizing

Because pods no longer consume VNet IP addresses, node pool scaling pressure shifts away from subnet size. This provides greater flexibility when designing large user node pools or burst scaling scenarios.

However, node count, autoscaler limits, and upgrade surge requirements still influence subnet sizing.

Choosing Between Overlay and Node Subnet

Here are the “TLDR” considerations when you need to make the choice of which networking model to use:

If deep network visibility, firewall inspection, and hybrid routing transparency are primary drivers, Node Subnet networking remains compelling.
If address space constraints, growth flexibility, and cluster density are primary concerns, Overlay networking provides significant advantages.
Most organisations adopting AKS at scale are moving toward overlay networking unless specific networking requirements dictate otherwise.

How Networking Impacts Node Pool Design

Let’s connect this back to the last post, where we said that Node pools are not just compute boundaries — they are networking consumption boundaries.

System Node Pools

System node pools:

Host core Kubernetes components
Require stability more than scale

From a networking perspective:

They should be small
They should be predictable in IP consumption
They must allow for upgrade surge capacity

If using Azure CNI, ensure sufficient IP headroom for control plane-driven scaling operations.

User Node Pools

User node pools are where networking pressure increases. Consider:

Maximum pods per node
Horizontal Pod Autoscaler behaviour
Node autoscaling limits

In Azure CNI Node Subnet environments, every one of those pods consumes an IP. If you design for 100 nodes with 30 pods each, that is 3,000 pod IPs — plus node IPs. Subnet planning must reflect worst-case scale, not average load.

In Azure CNI Overlay environments, the pressure shifts away from Azure subnets — but routing complexity increases.

Either way, node pool design and networking are a single architectural decision, not two separate ones.

Control Plane Networking and Security

One area that is often misunderstood is how the control plane communicates with the data plane, and how administrators securely interact with the cluster.

The Kubernetes API server is the central control surface. Every action — whether from kubectl, CI/CD pipelines, GitOps tooling, or the Azure Portal — ultimately flows through this endpoint.

In AKS, the control plane is managed by Azure and exposed through a secure endpoint. How that endpoint is exposed defines the cluster’s security posture.

Public Cluster Architecture

By default, AKS clusters expose a public API endpoint secured with authentication, TLS, and RBAC.

This does not mean the cluster is open to the internet. Access can be restricted using authorized IP ranges and Azure AD authentication.

Key characteristics:

API endpoint is internet-accessible but secured
Access can be restricted via authorized IP ranges
Nodes communicate outbound to the control plane
No inbound connectivity to nodes is required

This model is common in smaller environments or where operational simplicity is preferred.

Private Cluster Architecture

In a private AKS cluster, the API server is exposed via a private endpoint inside your VNet.

Administrative access requires private connectivity such as:

VPN
ExpressRoute
Azure Bastion or jump hosts

Key characteristics:

API server is not exposed to the public internet
Access is restricted to private networks
Reduced attack surface
Preferred for regulated or enterprise environments

Control Plane to Data Plane Communication

Regardless of public or private mode, communication between the control plane and the nodes follows the same secure pattern.

The kubelet running on each node establishes an outbound, mutually authenticated connection to the API server.

This design has important security implications:

Nodes do not require inbound internet exposure
Firewall rules can enforce outbound-only communication
Control plane connectivity remains encrypted and authenticated

This outbound-only model is a key reason AKS clusters can operate securely inside tightly controlled network environments.

Common Networking Pitfalls in AKS

Networking issues rarely appear during initial deployment. They surface later when scaling, integrating, or securing the platform. Typical pitfalls include:

subnets sized for today rather than future growth
no IP headroom for node surge during upgrades
lack of outbound traffic control
exposing the API server publicly without restrictions

Networking issues rarely appear on day one. They appear six months later — when scaling becomes necessary.

Aligning Networking with the Azure Well-Architected Framework

Operational Excellence improves when networking is designed for observability, integration, and predictable growth.
Reliability depends on zone-aware node pools, resilient ingress, and stable outbound connectivity.
Security is strengthened through private clusters, controlled egress, and network policy enforcement.
Cost Optimisation emerges from correct IP planning, right-sized ingress capacity, and avoiding rework caused by subnet exhaustion.

Making the right (or wrong) networking decisions in the design phase has an effect across each of these pillars.

What Comes Next

At this point in the series, we now understand:

Why Kubernetes exists
How AKS architecture is structured
How networking choices shape production readiness

In the next post, we’ll stay on the networking theme and take a look at Ingress and Egress traffic flows. See you then!

AKS Architecture Fundamentals

In the previous post From Containers to Kubernetes Architecture, we walked through the evolution from client/server to containers, and from Docker to Kubernetes. We looked at how orchestration became necessary once we stopped deploying single applications to single servers.

Now it’s time to move from history to design, and in this post we’re going to dive into the practical by focusing on:

How Azure Kubernetes Service (AKS) is actually structured — and what architectural decisions matter from day one.

Control Plane vs Data Plane – The First Architectural Boundary

In line with the core design of a vanilla Kubernetes cluster, every AKS cluster is split into two logical areas:

The control plane (managed by Azure)
The data plane (managed by you)

We looked at this in the last post, but lets remind ourselves of the components that make up each area.

The Control Plane (Azure Managed)

When you create an AKS cluster, you do not deploy your own API server or etcd database. Microsoft runs the Kubernetes control plane for you.

That includes:

The Kubernetes API server
etcd (the cluster state store)
The scheduler
The controller manager
Control plane patching and upgrades

This is not just convenience — it is risk reduction. Operating a highly available Kubernetes control plane is non‑trivial. It requires careful configuration, backup strategies, certificate management, and upgrade sequencing.

In AKS, that responsibility shifts to Azure. You interact with the cluster via the Kubernetes API (through kubectl, CI/CD pipelines, GitOps tools, or the Azure Portal), but you are not responsible for keeping the brain of the cluster alive.

That abstraction directly supports:

Operational Excellence
Reduced blast radius
Consistent lifecycle management

It also empowers Operations to enable Development teams to start their cycles earlier in the project as opposed to waiting for the control plane to be stood up and functionally ready.

The Data Plane (Customer Managed)

The data plane is where your workloads run. This consists of:

Virtual machine nodes
Node pools
Pods and workloads
Networking configuration

You choose:

VM SKU
Scaling behaviour
Availability zones
OS configuration (within supported boundaries)

This is intentional. Azure abstracts complexity where it makes sense, but retains control and flexibility where architecture matters.

Node Pools – Designing for Isolation and Scale

One of the most important AKS concepts is the node pool. A node pool is a group of VMs with the same configuration. At first glance, it may look like a scaling convenience feature. In production, it is an isolation and governance boundary.

There are 2 different types of node pool – System and User.

System Node Pool

Every AKS cluster requires at least one system node pool, which is a specialized group of nodes dedicated to hosting critical cluster components. While you can run application pods on them, their primary role is ensuring the stability of core services.

This pool runs:

Core Kubernetes components
Critical system pods

In production, this pool should be:

Small but resilient
Dedicated to system workloads
Not used for business application pods

In our first post, we took the default “Node Pool” option – however you do have the option to add a dedicated system node pool:

It is recommended that you create a dedicated system node pool to isolate critical system pods from your application pods to prevent misconfigured or rogue application pods from accidentally deleting system pods.

You don’t need the system node pool SKU size to be the same as your user node Pool SKU size – however its recommended that you make both highly available.

User Node Pools

User node pools are where your applications and workloads run. You can create multiple pools for different purposes within the same AKS cluster:

Compute‑intensive workloads
GPU workloads
Batch jobs
Isolated environments

Looking at the list, traditionally these workloads would have lived on their own dedicated hardware or processing areas. The advantage of AKS is that this enables:

Better scheduling control – the scheduler can control the assignment of resources to workloads in node pools.
Resource isolation – resources are isolated in their own node pools.
Cost optimisation – all of this runs on the same set of cluster VMs, so cost is predictive and stable.

In production, multiple node pools are not optional — they are architectural guardrails.

Regional Design and Availability Zones

Like all Azure resources, when you create an AKS cluster you choose a region. That decision impacts latency, compliance, resilience, and cost.

But production architecture requires a deeper question:

How does this cluster handle failure?

Azure supports availability zones in many regions. A production AKS cluster should at a bare minimum:

Use zone‑aware node pools
Distribute nodes across multiple availability zones

This ensures that a single data centre failure does not bring down your workloads. It’s important to understand that:

The control plane is managed by Azure for high availability
You are responsible for ensuring node pool zone distribution

Availability is shared responsibility.

Networking Considerations (At a High Level)

AKS integrates into an Azure Virtual Network. That means your cluster:

Has IP address planning implications
Participates in your broader network topology
Must align with security boundaries

Production mistakes often start here:

Overlapping address spaces
Under‑sized subnets
No separation between environments

Networking is not a post‑deployment tweak – its a day‑one design decision that you make with your wider cloud and architecture teams. We’ll go deep into networking in the next post, but even at the architecture stage, you need to make conscious choices.

Upgrade Strategy – Often Ignored, Always Critical

Kubernetes evolves quickly. AKS supports multiple Kubernetes versions, but older versions are eventually deprecated. The full list of supported AKS versions can be found at the AKS Release Status page

A production architecture must consider:

Version lifecycle
Upgrade cadence
Node image updates

In AKS, control plane upgrades are managed by Azure — but you control when upgrades occur and how node pools are rolled. When you create your cluster, you can specify the upgrade option you wish to use:

Its important to pay attention to this as it may affect your workloads. For example, starting in Kubernetes v1.35, Ubuntu 24.04 because the default OS SKU. What this means if you are upgrading from a lower version of Kubernetes, your node OS will be auto upgraded from Ubuntu 22.04 to 24.04.

Ignoring upgrade planning is one of the fastest ways to create technical debt in a cluster. This is why testing in lower environments to see how your workloads react to these upgrades is vital.

Mapping This to the Azure Well‑Architected Framework

Let’s anchor this back to the bigger picture and see how AKS maps to the Azure Well-Architected Framework.

Operational Excellence

AKS’s managed control plane reduces operational complexity.
Node pools introduce structured isolation.
Availability zones improve resilience.

Designing with these in mind from the start prevents reactive firefighting later.

Reliability

Zone distribution, multiple node pools, and scaling configuration directly influence workload uptime. Reliability is not added later — it is designed at cluster creation.

Cost Optimisation

Right‑sizing node pools and separating workload types prevents over‑provisioning. Production clusters that mix everything into one large node pool almost always overspend.

Production Guardrails – Early Principles

Before we move into deeper topics in the next posts, let’s establish a few foundational guardrails:

Separate system and user node pools
Use availability zones where supported
Plan IP addressing before deployment (again, we’ll dive into networking and how it affects workloads in more detail in the next post)
Treat upgrades as part of operations, not emergencies
Avoid “single giant node pool” design

These are not advanced optimisations. They are baseline expectations for production AKS.

What Comes Next

Now that we understand AKS architecture fundamentals, the next logical step is networking.

In the next post, we’ll go deep into:

Azure CNI and networking models
Ingress and traffic flow
Internal vs external exposure
Designing secure network boundaries

Because once architecture is clear, networking is what determines whether your cluster is merely functional or truly production ready.

See you on the next post!

From Containers to Kubernetes Architecture

In the previous post, What Is Azure Kubernetes Service (AKS) and Why Should You Care?, we got an intro to AKS, compared it to Azure PaaS services in terms of asking when is the right choice, and finally spun up an AKS cluster to demonstrate what exactly Microsoft exposes to you in terms of responsibilities.

In this post, we’ll take a step back to first principles and understand why containers and microservices emerged, how Docker changed application delivery, and how those pressures ultimately led to Kubernetes.

Only then does Kubernetes and by extension AKS architecture fully make sense.

From Monoliths to Microservices

If you rewind to the 1990s and early 2000s, most enterprise systems followed a fairly predictable pattern: client/server.

You either had thick desktop clients connecting to a central database server, or you had early web applications running on a handful of physical servers in a data centre. Access was often via terminal services, remote desktop, or tightly controlled internal networks.

Applications were typically deployed as monoliths. One codebase. One deployment artifact. One server—or maybe two, if you were lucky enough to have a test environment.

Infrastructure and application were deeply intertwined. If you needed more capacity, you bought another server. If you needed to update the application, you scheduled downtime. And this wasn’t like the downtime we know today – this could run into days, normally public holiday weekends where you had an extra day. Think you’re going to be having Christmas dinner or opening Easter eggs? Nope – thtere’s an upgrade on those weekends!

This model worked in a world where:

Release cycles were measured in months
Scale was predictable
Users were primarily internal or regionally constrained

But as the web matured in the mid-2000s, and SaaS became mainstream, expectations changed.

Virtualisation and Early Cloud

Virtual machines were the first major shift.

Instead of deploying directly to physical hardware, we began deploying to hypervisors. Infrastructure became more flexible. Provisioning times dropped from weeks to hours, and rollback of changes became easier too which de-risked the deployment process.

Then around 2008–2012, public cloud platforms began gaining serious enterprise traction. Infrastructure became API-driven. You could provision compute with a script instead of a purchase order.

Despite these changes, the application model was largely the same. We were still deploying monoliths—just onto virtual machines instead of physical servers.

The client/server model had evolved into a browser/server model, but the deployment unit was still large, tightly coupled, and difficult to scale independently.

The Shift to Microservices

Around the early 2010s, as organisations like Netflix, Amazon, and Google shared their scaling stories, the industry began embracing microservices more seriously.

Instead of a single large deployment, applications were broken into smaller services. Each service had:

A well-defined API boundary
Its own lifecycle
Independent scaling characteristics

This made sense in a world of global users and continuous delivery.

However, it introduced new complexity. You were no longer deploying one application to one server. You might be deploying 50 services across 20 machines. Suddenly, your infrastructure wasn’t just hosting an app—it was hosting an ecosystem.

And this is where the packaging problem became painfully obvious.

Docker and the Rise of Containers

Docker answered the packaging problem.

Containers weren’t new. Linux containers had existed in various forms for years. But Docker made them usable, portable, and developer-friendly.

Instead of saying “it works on my machine,” developers could now package:

Their application code
The runtime
All dependencies
Configuration

Into a single container image. That image could run on a laptop, in a data centre, or in the cloud—consistently. This was a major shift in the developer-to-operations contract.

The old model:

Developers handed over code
Operations teams configured servers
Problems emerged somewhere in between

The container model:

Developers handed over a runnable artifact
Operations teams provided a runtime environment

But Docker alone wasn’t enough.

Running a handful of containers on a single VM was manageable. Running hundreds across dozens of machines? That required coordination.

We had solved packaging. We had not solved orchestration. As container adoption increased, a new challenge emerged:

Containers are easy. Running containers at scale is not.

Why Kubernetes Emerged

Kubernetes emerged to solve the orchestration problem.

Instead of manually deciding where containers should run, Kubernetes introduced a declarative model. You define the desired state of your system—how many replicas, what resources, what networking—and Kubernetes continuously works to make reality match that description.

This was a profound architectural shift.

It moved us from:

Logging into servers via SSH
Manually restarting services
Writing custom scaling scripts

To:

Describing infrastructure and workloads declaratively
Letting control loops reconcile state
Treating servers as replaceable capacity

The access model changed as well. Instead of remote desktop or SSH being the primary control mechanism, the Kubernetes API became the centre of gravity. Everything talks to the API server.

This shift—from imperative scripts to declarative configuration—is one of the most important architectural changes Kubernetes introduced.

Core Kubernetes Architecture

To understand AKS, you first need to understand core Kubernetes components.

At its heart, Kubernetes is split into two logical areas: the control plane and the worker nodes.

The Control Plane – The Brain of the Cluster

The control plane is the brain of the cluster. It makes decisions, enforces state, and exposes the Kubernetes API.

Key components include:

API Server

The API server is the front door. Whether you use kubectl, a CI/CD pipeline, or a GitOps tool, every request flows through the API server. It validates requests and persists changes.

Entry point for all Kubernetes operations
Validates and processes requests
Exposes the Kubernetes API

Everything—kubectl, CI/CD pipelines, controllers—talks to the API server.

etcd

Behind the scenes sits etcd, a distributed key-value store that acts as the source of truth. It stores the desired and current state of the cluster. If etcd becomes unavailable, the cluster effectively loses its memory.

Distributed key-value store
Holds the desired and current state of the cluster
Source of truth for Kubernetes

If etcd is unhealthy, the cluster cannot function correctly.

Scheduler

The scheduler is responsible for deciding where workloads run. When you create a pod, the scheduler evaluates resource availability and constraints before assigning it to a node.

Decides which node a pod should run on
Considers resource availability, constraints, and policies

Controller Manager

The controller manager runs continuous reconciliation loops. It constantly compares the desired state (for example, “I want three replicas”) with the current state. If a pod crashes, the controller ensures another is created.

Runs control loops
Continuously checks actual state vs desired state
Takes action to reconcile differences

This combination is what makes Kubernetes self-healing and declarative.

Worker Nodes – Where Work Actually Happens

Worker nodes are where your workloads actually run.

Each node contains:

kubelet

Each node runs a kubelet, which acts as the local agent communicating with the control plane. It ensures that the containers defined in pod specifications are actually running.

Agent running on each node
Ensures containers described in pod specs are running
Reports node and pod status back to the control plane

Container Runtime

Underneath that sits the container runtime—most commonly containerd today. This is what actually starts and stops containers.

Responsible for running containers
Historically Docker, now containerd in most environments

kube-proxy

Networking between services is handled through Kubernetes networking constructs and components such as kube-proxy, which manages traffic rules.

Handles networking rules
Enables service-to-service communication n

Pods, Services, and Deployments

Above this infrastructure layer, Kubernetes introduces abstractions like pods, deployments, and services. These abstractions allow you to reason about applications instead of machines.

Pods

Smallest deployable unit in Kubernetes
One or more containers sharing networking and storage

Deployments

Define how pods are created and updated
Enable rolling updates and rollback
Maintain desired replica counts

Services

Provide stable networking endpoints
Abstract away individual pod lifecycles

You don’t deploy to a server. You declare a deployment. You don’t track IP addresses. You define a service.

How This Maps to Azure Kubernetes Service (AKS)

AKS does not change Kubernetes—it operationalises it. The Kubernetes architecture remains the same, but the responsibility model changes.

In a self-managed cluster, you are responsible for the control plane. You deploy and maintain the API server. You protect and back up etcd. You manage upgrades.

In AKS, Azure operates the control plane for you.

Microsoft manages the API server, etcd, and control plane upgrades. You still interact with Kubernetes in exactly the same way—through the API—but you are no longer responsible for maintaining its most fragile components.

You retain responsibility for worker nodes, node pools, scaling, and workload configuration. That boundary is deliberate.

It aligns directly with the Azure Well-Architected Framework:

Operational Excellence through managed control plane abstraction
Reduced operational risk and complexity
Clear separation between platform and workload responsibility

AKS is Kubernetes—operationalised.

Why This Matters for Production AKS

Every production AKS decision maps back to Kubernetes architecture:

Networking choices affect kube-proxy and service routing
Node pool design affects scheduling and isolation
Scaling decisions interact with controllers and the scheduler

Without understanding the underlying architecture, AKS can feel opaque.

With that understanding, it becomes predictable.

What Comes Next

Now that we understand:

Why containers emerged
Why Kubernetes exists
How Kubernetes is architected
How AKS maps to that architecture

We’re ready to start making design decisions.

In the next post, we’ll move into AKS architecture fundamentals, including:

Control plane and data plane separation
System vs user node pools
Regional design and availability considerations

See you on the next post

What Is Azure Kubernetes Service (AKS) and Why Should You Care?

In every cloud native architecture discussion you have had over the last few years or are going to have in the coming years, you can be guaranteed that someone has or will introduce Kubernetes as a hosting option on which your solution will run.

There’s also different options when Kubernetes enters the conversation – you can choose to run:

Original Kubernetes, with full access to management layers.
Cloud Hypervisor versions such as Amazon EKS, Google Kubernetes Engine or Azure Kubernetes Service (AKS) which abstract the control plane away leaving you to manage worker nodes.
Vendor-specific offerings such as Red Hat Openshift or VMware Tanzu, which can run on both cloud hypervisors or your own choice of underlying infra workload (on-premises, hybrid or cloud-based).
Lightweight versions such as K3s which are useful for scenarios such as Edge or IoT deployments.

Kubernetes promises portability, scalability, and resilience. In reality, operating Kubernetes yourself is anything but simple.

Have you’ve ever wondered whether Kubernetes is worth the complexity—or how to move from experimentation to something you can confidently run in production?

Me too – so let’s try and answer that question. For anyone who knows me or has followed me for a few years knows, I like to get down to the basics and “start at the start”.

This is the first post is of a blog series where we’ll focus on Azure Kubernetes Service (AKS), while also referencing the core Kubernetes offerings as a reference. The goal of this series is:

By the end (whenever that is – there is no set time or number of posts), we will have designed and built a production‑ready AKS cluster, aligned with the Azure Well‑Architected Framework, and suitable for real‑world enterprise workloads.

With the goal clearly defined, let’s start at the beginning—not by deploying workloads or tuning YAML, but by understanding:

Why AKS exists
What problems it solves
When it’s the right abstraction.

What Is Azure Kubernetes Service (AKS)?

Azure Kubernetes Service (AKS) is a managed Kubernetes platform provided by Microsoft Azure. It delivers a fully supported Kubernetes control plane while abstracting away much of the operational complexity traditionally associated with running Kubernetes yourself.

At a high level:

Azure manages the Kubernetes control plane (API server, scheduler, etcd)
You manage the worker nodes (VM size, scaling rules, node pools)
Kubernetes manages your containers and workloads

This division of responsibility is deliberate. It allows teams to focus on applications and platforms rather than infrastructure mechanics.

You still get:

Native Kubernetes APIs
Open‑source tooling (kubectl, Helm, GitOps)
Portability across environments

But without needing to design, secure, patch, and operate Kubernetes from scratch.

Why Should You Care About AKS?

The short answer:

AKS enables teams to build scalable platforms without becoming Kubernetes operators.

The longer answer depends on the problems you’re solving.

AKS becomes compelling when:

You’re building microservices‑based or distributed applications
You need horizontal scaling driven by demand
You want rolling updates and self‑healing workloads
You’re standardising on containers across teams
You need deep integration with Azure networking, identity, and security

Compared to running containers directly on virtual machines, AKS introduces:

Declarative configuration
Built‑in orchestration
Fine‑grained resource management
A mature ecosystem of tools and patterns

However, this series is not about adopting AKS blindly. Understanding why AKS exists—and when it’s appropriate—is essential before we design anything production‑ready.

AKS vs Azure PaaS Services: Choosing the Right Abstraction

Another common—and more nuanced—question is:

“Why use AKS at all when Azure already has PaaS services like App Service or Azure Container Apps?”

This is an important decision point, and one that shows up frequently in the Azure Architecture Center.

Azure PaaS Services

Azure PaaS offerings such as App Service, Azure Functions, and Azure Container Apps work well when:

You want minimal infrastructure management responsibility
Your application fits well within opinionated hosting models
Scaling and availability can be largely abstracted away
You’re optimising for developer velocity over platform control

They provide:

Very low operational overhead – the service is an “out of the box” offering where developers can get started immediately.
Built-in scaling and availability – scaling comes as part of the service based on demand, and can be configured based on predicted loads.
Tight integration with Azure services – integration with tools such as Azure Monitor and Application Insights for monitoring, Defender for Security monitoring and alerting, and Entra for Identity.

For many workloads, this is exactly the right choice.

AKS

AKS becomes the right abstraction when:

You need deep control over networking, runtime, and scheduling
You’re running complex, multi-service architectures
You require custom security, compliance, or isolation models
You’re building a shared internal platform rather than a single application

AKS sits between IaaS and fully managed PaaS:

Azure PaaS abstracts the platform for you. AKS lets you build the platform yourself—safely.

This balance of control and abstraction is what makes AKS suitable for production platforms at scale.

Exploring AKS in the Azure Portal

Before designing anything that could be considered “production‑ready”, it’s important to understand what Azure exposes out of the box – so lets spin up an AKS instance using the Azure Portal.

Step 1: Create an AKS Cluster

Sign in to the Azure Portal
In the search bar at the top, Search for Kubernetes Service

When you get to the “Kubernetes center page”, click on “Clusters” on the left menu (it should bring you here automatically). Select Create, and select “Kubernetes cluster”. Note that there are also options for “Automatic Kubernetes cluster” and “Deploy application” – we’ll address those in a later post.

Choose your Subscription and Resource Group

Enter a Cluster preset configuration, Cluster name and select a Region. You can choose from four different preset configurations which have clear explanations based on your requirements

I’ve gone for Dev/Test for the purposes of spinning up this demo cluster.

Leave all other options as default for now and click “Next” – we’ll revisit these in detail in later posts.

Step 2: Configure the Node Pool

Under Node pools, there is an agentpool automatically added for us. You can change this if needed to select a different VM size, and set a low min/max node count

This is your first exposure to separating capacity management from application deployment.

Step 3: Networking

Under Networking, you will see options for Private/Public Access, and also for Container Networking. This is an important chopice as there are 2 clear options:

Azure CNI Overlay – Pods get IPs from a private CIDR address space that is separate from the node VNet.
Azure CNI Node Subnet – Pods get IPs directly from the same VNet subnet as the nodes.

You also have the option to integrate this into your own VNet which you can specify during the cluster creation process.

Again, we’ll talk more about these options in a later post, but its important to understand the distinction between the two.

Step 4: Review and Create

Select Review + Create – note at this point I have not selected any monitoring, security or integration with an Azure Container Registry and am just taking the defaults. Again (you’re probably bored of reading this….), we’ll deal with these in a later post dedicated to each topic.

Once deployed, explore:

Node pools
Workloads
Services and ingresses
Cluster configuration

Notice how much complexity is hidden – if you scroll back up to the “Azure-managed v Customer-managed” diagram, you have responsibility for managing:

Cluster nodes
Networking
Workloads
Storage

Even though Azure abstracts away responsibility for things like key-value store, scheduler, controller and management of the cluster API, a large amount of responsibility still remains.

What Comes Next in the Series

This post sets the foundation for what AKS is and how it looks out of the box using a standard deployment with the “defaults”.

Over the course of the series, we’ll move through the various concepts which will help to inform us as we move towards making design decisions for production workloads:

Kubernetes Architecture Fundamentals (control plane, node pools, and cluster design), and how they look in AKS
Networking for Production AKS (VNets, CNI, ingress, and traffic flow)
Identity, Security, and Access Control
Scaling, Reliability, and Resilience
Cost Optimisation and Governance
Monitoring, Alerting and Visualizations
Alignment with the Azure Well Architected Framework
And lots more ……

See you on the next post!

You’re Already a Public Speaker (You Just Don’t Know It Yet)

2025 was a great year for me from a community speaking perspective. I had the opportunity to speak in-person at conferences like South Coast Summit, Nordic Integration Summit, and Global Azure and AI Community Day, and virtually at community events like Azure Back to School, India Cloud Security Summit and Festive Tech Calendar.

The one question that people keep asking me is: How do you get started as a public speaker?”. The answer usually surprises people.

You probably already are one.

Public Speaking Isn’t Where You Think It Starts

When people hear “public speaker,” they often picture a conference stage, a headset microphone, and a perfectly polished slide deck and demo.

But that’s not where public speaking actually begins. If you’ve ever:

Presented a solution to a client
Walked an internal team through an architecture decision
Explained why a particular design choice mattered
Defended a proposal in front of stakeholders

Then congratulations—you’re already doing the hardest part.

You’re communicating ideas, adapting to your audience, answering questions, and telling a story with a beginning, middle, and end. The only real difference between that and a community talk is the room you’re in.

From Meetings to Community Stages

The journey from a meeting room to a community stage isn’t about learning a completely new skill. It’s about refining the one you already have. At its core:

A client meeting is storytelling under pressure
A community talk is storytelling with support

Community audiences want you to succeed. They’re there because they care about the topic, not because they’re trying to approve a budget or poke holes in a proposal. The tech community doesn’t require or want perfection – we thrive on real-world experiences and shared honestly. The community doesn’t show up to your talk to learn what to do. We want to hear your story, and learn what actually happened when you tried it.

So where do I start?

The first thing to do is to actually attend a conference as an attendee. Get yourself out there and introduce yourself to people in the community. We don’t bite – we’re all real people as well. Go to the sessions at that conference on topics that you are interested in. Watch the presenters and how their sessions flow and work. And go introduce yourself to the presenters afterwards – give some feedback and ask some questions.

Once you have that first experience, start somewhere safe.

An internal lunch and learn
A team brown-bag session

These environments are supportive by design. People expect experimentation, not perfection. Don’t send the essay-style email when you have found and solved a problem – gather your team for 15 minutes over lunch and explain what you found, how you solved it, the lessons you learned and how you can apply these lessons next time.

You have your first talk – what happens now

Now that you have a talk, the next step is to turn it into a session. The basic premise is when you encountered a problem or had to explain a solution to someone:

What problem were they trying to solve?
What options did you consider?
What trade-offs did you make?
What would you do differently next time?

That conversation is already the outline of a community session. Add slides and/or a demo, structure it with a problem, solution and outcome, and suddenly you have a talk that’s grounded in reality—because it came from real work.

For most conferences, your talk is going to be either a 15-20 minute lightning talk, or else a 35-50 minute session. A hack that I’ve used to time my sessions is having a script per slide and a script for your demo. Your slides should be bullet points along with graphics or diagrams, and you then need to talk through the demo as well. Having this scripted out means you can practice and time yourself on how long the session will take to deliver.

Once you have the session done, you need to prepare your session abstract which is normally done by setting up a profile on Sessionize or Run.Events. Writing the abstract is important, and my friend and fellow MVP Zoe Wilson has written a great post on how to write a great session abstrat, which you can find at this link.

Where do I find events to submit to?

So now that you have your session, you need to submit it somewhere. And there’s a few options here:

Virtual Events – these are a great starting point, as they are normally events which require you to submit either blog posts or pre-recorded content. There’s also normally no “entry criteria” as all sessions are automatically approved. These events are great starting points to grow your presence in the community, and run at the same times every year. As well as the ones I’ve mentioned above, check out Azure Spring Clean, Cyber Back to School and WeDoAI (these are just some examples).
User Groups – remember that speaker that you went and gave feedback to above? There’s a good chance they are involved or know someone who is part of a User Group. User Groups like Welsh Azure (run by John Lunn aka Jonny Chipz) and Microsoft Azure Community (Run by Kevin Greene and Nicholas Chang) are run virtually, whereas Glasgow Azure (run by Sarah Lean and Gregor Suttie) is run in-person, so there’s a good balance depending on where you are located. Again, attend as an attendee – these are good fun and some have quizzes, spot prizes and refreshments for the in-person ones!
Conferences – if you’ve spoken or submitted to the first 2, then the live in-person conference is the next step. As I said above, these are welcoming environments where everyone is supported.

You can find all of these in a few ways. I mentioned Sessionize and run.events, and you can also find an extensive list of upcoming conferences for both attendees and speakers at https://www.communitydays.org/. Also, start following or connecting with folks from the community on socials – we normally share user groups and CFS links for events and conferences.

You will get rejected, but keep going!!

My Sessionize profile has quite a lot of red in it:

But thats OK – I don’t expect to be selected for every single session I submit to every conference or User Group. The difference between people who “can’t speak” and those who do regularly isn’t talent, it’s persistence. Keep submitting your sessions – the rejections can sting at the start but just keep going.

The best advice here is to reach out and ask for advice – the conference organisers have hundreds of sessions to choose from and can’t give feedback on every single rejected session. If thats happening, reach out to someone in the community for help – again, we are all humans and have all gone through this. Sometimes its something as simple as the wording, the title, or a little tweak that needed.

Final Thoughts

Community speaking isn’t a performance, it’s a conversation, just at a slightly larger scale. You’re not there to prove how much you know, you’re there to share what you’ve learned so far. If you’ve ever explained a technical decision to another human being, you already have what it takes to be a community speaker.

Azure Lab Services Is Retiring: What to Use Instead (and How to Plan Your Migration)

Microsoft has announced that Azure Lab Services will be retired on June 28, 2027. New customer sign-ups have already been disabled as of July 2025, which means the clock is officially ticking for anyone using the service today.

You can read the official announcement on Microsoft Learn here: https://learn.microsoft.com/en-us/azure/lab-services/retirement-guide

While 2027 may feel a long way off, now is the time to take action!

For those of you who have never heard of Azure Lab Services, lets take a look at what it was and how you would have interacted with it (even if you didn’t know you were!).

What is/was Azure Lab Services?

Azure Lab Services allowed you to create labs with infrastructure managed by Azure. The service handles all the infrastructure management, from spinning up virtual machines (VMs) to handling errors and scaling the infrastructure.

If you’ve ever been on a Microsoft course, participated in a Virtual Training Days course, or attended a course run by a Microsoft MCT, Azure Lab Services is what the trainer would have used to facilitate:

Classrooms and training environments
Hands-on labs for workshops or certifications
Short-lived dev/test environments

Azure Lab Services was popular because it abstracted away a lot of complexity around building lab or classroom environments. Its retirement doesn’t mean Microsoft is stepping away from virtual labs—it means the responsibility shifts back to architecture choices based on the requirements you have.

If you or your company is using Azure Lab Services, the transition to a new service is one of those changes where early planning pays off—especially if your labs are tied to academic calendars, training programmes, or fixed budgets.

So what are the alternatives?

Microsoft has outlined several supported paths forward. None are a 1:1 replacement, so the “right” option depends on who your users are and how they work. While these solutions aren’t necessarily education-specific, they support a wide range of education and training scenarios.

Azure Virtual Desktop (AVD)

🔗 https://learn.microsoft.com/azure/virtual-desktop/

AVD is the most flexible option and the closest match for large-scale, shared lab environments. AVD is ideal for providing full desktop and app delivery scenarios and provides the following benefits:

Multi-session Windows 10/11, which either Full Desktop or Single App Delivery options
Full control over networking, identity, and images. One of the great new features of AVD (still in preview mode) is that you can now use Guest Identities in your AVD environments, which can be really useful for training environments and takes the overhead of user management away.
Ideal for training labs with many concurrent users
Supports scaling plans to reduce costs outside working hours (check out my blog post on using Scaling Plans in your AVD Environments)

I also wrote a set of blog posts about setting up your AVD environments from scratch which you can find here and here.

Windows 365

🔗 https://learn.microsoft.com/windows-365/

Windows 365 offers a Cloud PC per user, abstracting away most infrastructure concerns. Cloud PC virtual machines are Microsoft Entra ID joined and support centralized end-to-end management using Microsoft Intune. You assign Cloud PC’s by assigning a license to that user in the same way as you would assign Microsoft 365 licences. The benefits of Windows 365 are:

Simple to deploy and manage
Predictable per-user pricing
Well-suited to classrooms or longer-lived learning environments

The trade-off is that there is less flexibility and typically higher cost per user than shared AVD environments, as the Cloud PC’s are dedicated to the users and cannot be shared.

Azure DevTest Labs

🔗 https://learn.microsoft.com/azure/devtest-labs/

A strong option for developer-focused labs, Azure DevTest labs are targeted at enterprise customers. It also has a key difference to the other alternative solutions, its the only one that offers access to Linux VMs as well as Windows VMs.

Supports Windows and Linux
Built-in auto-shutdown and cost controls
Works well for dev/test and experimentation scenarios

Microsoft Dev Box

🔗 https://learn.microsoft.com/dev-box/

Dev Box is aimed squarely at professional developers. It’s ideal for facilitating hands-on learning where training leaders can use Dev Box supported images to create identical virtual machines for trainees. Dev Box virtual machines are Microsoft Entra ID joined and support centralized end-to-end management with Microsoft Intune.

High-performance, secure workstations
Integrated with developer tools and workflows
Excellent for enterprise engineering teams

However, its important to note that as of November 2025, DevBox is being integrated into Windows365. The service is built on top of Windows365, so Micrsoft has decided to unify the offerings. You can read more about this announcement here but as of November 2025, Microsoft are no longer accepting new DevBox customers – https://learn.microsoft.com/en-us/azure/dev-box/dev-box-windows-365-announcement?wt.mc_id=AZ-MVP-5005255

When First-Party Options Aren’t Enough

If you relied heavily on the lab orchestration features of Azure Lab Services (user lifecycle, lab resets, guided experiences), you may want to evaluate partner platforms that build on Azure:

Nerdio – https://www.getnerdio.com
Spektra Systems – https://www.spektrasystems.com
Apporto – https://www.apporto.com
Skillable – https://www.skillable.com

These solutions provide:

Purpose-built virtual lab platforms
User management and lab automation
Training and certification-oriented workflows

They add cost, but also significantly reduce operational complexity.

Comparison: Azure Lab Services Alternatives

Lets take a look at a comparison of each service showing cost, use cases and strengths:

Service	Typical Cost Model	Best Use Cases	Key Strength	When 3rd Party Tools Are Needed
Azure Virtual Desktop	Pay-per-use (compute + storage + licensing)	Large classrooms, shared labs, training environments	Maximum flexibility and scalability	For lab orchestration, user lifecycle, guided labs
Windows 365	Per-user, per-month	Classrooms, longer-lived learning PCs	Simplicity and predictability	Rarely needed
Azure DevTest Labs	Pay-per-use with cost controls	Dev/test, experimentation, mixed OS labs	Cost governance	For classroom-style delivery
Microsoft Dev Box	Per-user, per-month	Enterprise developers	Performance and security	Not typical
Partner Platforms	Subscription + Azure consumption	Training providers, certification labs	Turnkey lab experiences	Core dependency

Don’t Forget Hybrid Scenarios

If some labs or dependencies must remain on-premises, you can still modernise your management approach by deploying Azure Virtual Desktop locally and manage using Azure Arc, which will allow you to

Apply Azure governance and policies
Centralise monitoring and management
Transition gradually toward cloud-native designs

Start Planning Now

With several budget cycles between now and June 2027, the smartest move is to:

Inventory existing labs and usage patterns
Map them to the closest-fit replacement
Pilot early with a small group of users

Azure Lab Services isn’t disappearing tomorrow—but waiting until the last minute will almost certainly increase cost, risk, and disruption.

If you treat this as an architectural evolution rather than a forced migration, you’ll end up with a platform that’s more scalable, more secure, and better aligned with how people actually learn and work today.