Three Azure Networking Assumptions That Will Burn You in Production

Azure networking documentation covers a lot of ground. What it is less good at is surfacing the assumptions embedded in common configurations — the things that appear safe on paper but create real risk in production environments.

This post is about three of those assumptions:

  • NSG service tags and when to use them instead of IP ranges
  • The impact of default routes on Azure service connectivity
  • The behaviour of Private Endpoints in relation to NSG enforcement.

These are not edge cases — they appear in standard Azure architectures and are the source of a disproportionate number of production networking incidents.

NSG Service Tags vs IP Ranges and CIDRs — Getting the Choice Right

Network Security Groups are the primary mechanism for controlling traffic in Azure virtual networks. When writing NSG rules, you have two options for specifying the source or destination of traffic: you can use explicit IP addresses and CIDR ranges, or you can use service tags.

Image Credit – Microsoft Learn

Both approaches have a place. The mistake is using one when you should be using the other.

What Service Tags Actually Are

Image Credit – Microsoft Learn

A service tag is a named group of IP address prefixes associated with a specific Azure service. When you reference a service tag in an NSG rule, Azure resolves it to the underlying IP ranges automatically — and manages and updates those ranges on a weekly basis. When you use a service tag, you are delegating IP list management to Microsoft. For Azure-managed services like Storage, Key Vault, and Azure Monitor, that is exactly the right trade-off. You should not be manually maintaining and updating those ranges.

When to Use Service Tags

Service tags belong in NSG rules wherever you are controlling traffic to or from an Azure-managed service whose IP ranges change over time: Azure Monitor, Key Vault, Container Registry, SQL, Service Bus, Event Hub. They are also the right choice for platform-level rules — using the AzureLoadBalancer tag for health probe allowances, for example, is far more reliable than trying to maintain a list of probe IPs.

When IP Ranges and CIDRs Are the Right Choice

Explicit CIDRs belong in NSG rules when you are controlling traffic between resources you own — on-premises ranges, partner network CIDRs, specific application subnets within your own VNet, or third-party services with stable published IP ranges. When your security team needs to audit exactly which addresses a rule permits, a CIDR answers that definitively. A service tag defers the answer to a Microsoft-managed list that changes weekly.

The Service Tag Scope Problem

The most common service tag mistake is using broad global tags when regional variants are available and appropriate.

Consider AzureCloud. Using this tag in an NSG rule opens access to the IP ranges associated with all Azure services globally — and critically, that includes IP addresses used by other Azure customers, not just Microsoft’s own infrastructure. This means AzureCloud is a much broader permission than most engineers assume. Microsoft’s own documentation explicitly warns that in most scenarios, allowing traffic from all Azure IPs via this tag is not recommended. If your workload only needs to communicate with services in West Europe, using AzureCloud.WestEurope instead gives you the same coverage for your actual traffic pattern while dramatically reducing the permitted address space.

TagScopeRecommendation
AzureCloudAll Azure IP ranges globally — very broadAvoid. Use regional variant or specific service tag.
AzureCloud.WestEuropeAzure IP ranges for West Europe onlyUse when regional scoping is sufficient.
StorageAll Azure Storage endpoints globallyPrefer Storage.<region> where possible.
Storage.WestEuropeAzure Storage in West Europe onlyPreferred for regionally scoped workloads.
AzureMonitorAzure Monitor endpointsAppropriate for monitoring agent outbound rules.
AzureLoadBalancerAzure Load Balancer probe IPsAlways use for health probe allow rules.

The practical enforcement approach is to use Azure Policy to flag or deny NSG rules that reference broad global tags where regional equivalents exist. This moves the governance left — catching overly permissive rules before they reach production rather than after.

# Verify current service tag ranges for a region
az network list-service-tags \
--location westeurope \
--output json \
--query "values[?name=='AzureCloud.WestEurope']"

Default Routes, UDRs, and What Force Tunnelling Actually Breaks

Routing all outbound traffic through a central firewall via a 0.0.0.0/0 UDR is a standard hub-and-spoke pattern. Security teams require it, and it works — but it consistently catches engineers out in one area the documentation does not make obvious enough.

The problem is not that a default route intercepts too much traffic at the network layer. The problem is that most force-tunnelling configurations are deployed without firewall rules to permit the Azure service traffic that workloads silently depend on, and the symptoms that follow are rarely traced back to the routing change quickly.

168.63.129.16 — The Platform IP You Need to Understand

Before going further, it is worth being precise about 168.63.129.16. Microsoft documents this as a virtual public IP address — not a link-local address, but a special public IP owned by Microsoft and used across all Azure regions and national clouds. It provides DNS name resolution, Load Balancer health probe responses, DHCP, VM Agent communication, and Guest Agent heartbeat for PaaS roles.

The important thing to know about 168.63.129.16 in the context of UDRs is this: Microsoft Learn explicitly states that this address is a virtual IP of the host node and as such is not subject to user defined routes. Azure’s DHCP system injects a specific classless static route for 168.63.129.16/32 via the subnet gateway, ensuring platform traffic bypasses UDRs at the platform level. A 0.0.0.0/0 default route does not intercept traffic to this address.

What a 0.0.0.0/0 UDR does intercept is everything else: general internet-bound traffic, and outbound traffic to Azure service public endpoints — including Azure Monitor, Azure Key Vault, Microsoft Entra ID, Azure Container Registry, and any other PaaS service your workloads communicate with. These services are reachable via public IPs, and those IPs are subject to UDRs.

⚠️ What actually breaks when you deploy a 0.0.0.0/0 UDR without firewall rules Workloads that depend on Azure Monitor for diagnostics stop sending telemetry. Managed identity token acquisition fails if traffic to the Entra ID and IMDS endpoints is not permitted by the firewall. Applications pulling images from Azure Container Registry, secrets from Key Vault, or messages from Service Bus will lose connectivity. The common thread is that your firewall is now in the path of traffic it has not been told to permit. The workloads fail; the routing change is often not the first place anyone looks.

The Right Approach to Force Tunnelling

Deploying a 0.0.0.0/0 UDR and a corresponding firewall must be treated as a single unit of change, not two separate steps. The firewall rules need to be in place before the UDR is applied, not after symptoms appear.

Before any UDR deployment, inventory the Azure service dependencies of every workload on the affected subnet and verify that the firewall policy explicitly permits outbound traffic to the corresponding service tags. AzureMonitor, AzureActiveDirectory, AzureKeyVault, Storage, AzureContainerRegistry — each service your workloads depend on must have a corresponding firewall application or network rule. Then verify Effective Routes on the affected subnets and NICs to confirm what will happen before it happens.

# Review effective routes before any UDR change
az network nic show-effective-route-table \
--resource-group <rg-name> \
--name <nic-name> \
--output table

A route table change that appears clean from a routing perspective can still break applications if the firewall has gaps. Effective Routes verification and firewall rule review should both be mandatory steps in any network change process that involves UDRs.

Private Endpoints and the NSG Enforcement Gap

Private Endpoints are one of the most effective controls for locking down access to Azure PaaS services. When you deploy a Private Endpoint for a storage account, Key Vault, or SQL database, that service gets a private IP on your VNet subnet and traffic travels within the Azure backbone. The assumption that naturally follows — that an NSG on the Private Endpoint’s subnet controls access to it — is incorrect by default.

How NSG Evaluation Works for Private Endpoints

By default, network policies are disabled for a subnet in a virtual network — which means NSG rules are not evaluated for traffic destined for Private Endpoints on that subnet. This is a platform default, not a misconfiguration. A deny rule you expect to block access from a specific source will be silently ignored.

⚠️ Why this matters An application or workload with network connectivity to the Private Endpoint’s subnet can reach the private IP of the endpoint regardless of the NSG rules you have defined. There is no error, no alert, and no indication in the portal that the NSG is not being evaluated. The exposure is silent.

Enabling NSG Enforcement on Private Endpoint Subnets

Microsoft added the ability to restore NSG evaluation for Private Endpoint traffic through a subnet property called privateEndpointNetworkPolicies. Setting this property to Enabled causes NSG rules to be evaluated for traffic destined for Private Endpoints on that subnet, in the same way they would be for any other resource.

Azure CLI Method:

# Enable NSG enforcement for Private Endpoints on a subnet
az network vnet subnet update \
--resource-group <rg-name> \
--vnet-name <vnet-name> \
--name <subnet-name> \
--disable-private-endpoint-network-policies false

Portal Method:

This change should be applied to all subnets where Private Endpoints are deployed, and it should be part of your standard subnet configuration in IaC rather than something applied reactively after deployment. In Terraform, the equivalent property is private_endpoint_network_policies = “Enabled” on the subnet resource.

NSG Rules Are Not Enough on Their Own

Enabling NSG enforcement is necessary, but it is not sufficient as the only access control for sensitive data services. Network controls restrict which sources can reach an endpoint — they cannot govern what those sources do once connected. Managed Identity with scoped RBAC assignments should be the minimum access model for any workload reaching Azure data services through a Private Endpoint.

✅ The defence-in-depth model for Private Endpoints Enable privateEndpointNetworkPolicies on the subnet so that NSG rules are enforced. Write NSG rules that restrict inbound access to the Private Endpoint’s private IP to only the sources that need it. Require Managed Identity authentication with scoped RBAC assignments for all service access. Disable public network access on the backing service entirely. These controls work together — removing any one of them weakens the posture.

Summary: Three Controls, Three Gaps

AreaCommon AssumptionThe RealityThe Fix
NSG Service TagsAzureCloud is a safe, conservative choiceAzureCloud is a broad, dynamic tag covering all Azure IPs globallyUse regional tags (AzureCloud.WestEurope). Enforce with Azure Policy.
Default Routes / UDRsA 0.0.0.0/0 UDR and a firewall are all you need to control outbound traffic168.63.129.16 is not subject to UDRs, but all Azure service endpoint traffic is — if the firewall has no rules for it, workloads break silentlyDefine firewall rules for all Azure service dependencies before applying the UDR. Check Effective Routes first.
Private EndpointsAn NSG on the PE subnet controls access to the endpointNSG rules are not evaluated for PE traffic by defaultEnable privateEndpointNetworkPolicies on the subnet. Require Managed Identity + RBAC.

Conclusion

The three patterns in this post — service tag scoping, force-tunnelling configuration, and Private Endpoint NSG enforcement — share a common characteristic: they are not wrong configurations. They are default behaviours, or natural-seeming choices, whose consequences the documentation does not surface at the point where those choices are made.

The goal is to move the point of discovery earlier. Understanding that AzureCloud covers other customers’ IP ranges before writing that NSG rule. Knowing that a 0.0.0.0/0 UDR puts your firewall in the path of all Azure service traffic before applying that route table. Checking Private Endpoint network policies before writing the first NSG rule for a PE subnet. These are the things that turn reactive incident investigation into proactive design decisions.

Azure networking is not inherently complex. But it rewards engineers who take the time to understand what the defaults are doing, and why.

Azure Subnet Delegation: The Three Words That Break Deployments

I’ve been working with a customer who wants to migrate from Azure SQL Server to Azure SQL Managed Instance. It was the right choice for them – they want to manage multiple databases, so moving away from the DTU Model combined with the costs of running each database indepenently made this a simple choice.

So, lets go set up the deployment. We’ll just deploy it into the same subnet, as it will make life easier during the migration phase……

And it failed. So like most teams would, everyone went looking in the usual places:

  • Was it the NSG?
  • Was it the route table?
  • Was there an address space overlap somewhere?
  • Had DNS been configured incorrectly?
  • Was there some hidden policy assignment blocking the deployment?

The problem was three words in the Azure documentation that nobody on the team had flagged: requires subnet delegation.

And that was it. A deployment failure caused by something that takes about ninety seconds to fix when you know what you’re looking for.

The frustrating part is not that subnet delegation exists. In fairness, Azure has good reasons for it. The frustrating part is that it often surfaces as a deployment failure that sends you in entirely the wrong direction first.

Terminal Provisioning State. The Azure error for everything that tells you nothing…….

This post is about what subnet delegation actually is, why it breaks deployments in ways that are surprisingly difficult to diagnose, and — more importantly — how to make sure it never catches you out again.

What is Subnet Delegation?

At the simplest level, subnet delegation is Azure’s way of saying:

this subnet belongs to this service now.

Not in the sense that you lose visibility of it. Not in the sense that you cannot still apply controls around it. But in the sense that a particular Azure service needs permission to configure aspects of that subnet in order to function properly.

The reason for this is straightforward. Some services need to apply their own network policies, routing rules, and management plane configurations to the subnet. They can’t do that reliably if other resources are competing for the same address space. So Azure introduces the concept of delegation: the subnet is formally assigned to a specific service, and that service becomes the owner.

A delegated subnet belongs to one service. That’s it. No virtual machines, no load balancers, no other PaaS services sharing the space alongside it. The subnet is reserved for the delegated service, not just partially occupied by it.

What Happens When You Get It Wrong

The moment you try to place another resource into a delegated subnet — or deploy a service that requires delegation into a subnet that hasn’t been configured for it — the deployment fails.

And Azure’s error messaging in these situations is not always helpful.

What you typically get is a generic deployment failure (I mean, what the hell does “terminal provisioning state” mean anyway???). The portal or CLI may or may not surface an error that points you at the resource configuration, and the natural instinct is to start checking the things you know: NSG rules, route tables, address space availability. These are the usual suspects in VNet troubleshooting. You work through them methodically and find nothing wrong — because nothing is wrong with them.

What you don’t immediately think to check is the delegation tab on the subnet properties. Why would you? In most VNet troubleshooting scenarios, subnet delegation never comes up. For architects who spend most of their time working with IaaS workloads, it’s simply not part of the mental checklist.

Which Services Require It?

Subnet delegation isn’t a niche requirement for obscure services. It applies to some of the most commonly deployed PaaS workloads in enterprise Azure environments.

At this point, its important to distinguish between “Dedicated” and “Delegated” subnets. Some Azure services such as Bastion, Firewall and VNET Gateway have specific naming and sizing requirements for the subnets that they live in, which means they are dedicated.

I’ve tried to summarize in the table below the services that need both dedicated and delegated subnets. The list may or may not be exhaustive – the reason is that I can’t find a single source of reference on Microsoft Learn or GitHub that shows me what services require delegation. So buddy Copilot may have helped with compiling this list …..

Azure serviceRequirementNotes
Compute & containers
Azure Kubernetes Service (AKS) — kubenetDedicatedNode and pod CIDRs consume the entire subnet; mixing breaks routing
AKS — Azure CNIDedicatedEach pod gets a VNet IP; subnet exhaustion risk with shared use
Azure Container Instances (ACI)DelegationDelegate to Microsoft.ContainerInstance/containerGroups
Azure App Service / Function App (VNet Integration)DelegationDelegate to Microsoft.Web/serverFarms; /26 or larger recommended
Azure Batch (simplified node communication)DelegationDelegate to Microsoft.Batch/batchAccounts
Networking & gateways
Azure VPN GatewayDedicatedSubnet must be named GatewaySubnet
Azure ExpressRoute GatewayDedicatedAlso uses GatewaySubnet; can co-exist with VPN Gateway in same subnet
Azure Application Gateway v1/v2DedicatedSubnet must contain only Application Gateway instances
Azure FirewallDedicatedSubnet must be named AzureFirewallSubnet; /26 minimum
Azure Firewall ManagementDedicatedRequires separate AzureFirewallManagementSubnet; /26 minimum
Azure BastionDedicatedSubnet must be named AzureBastionSubnet; /26 minimum
Azure Route ServerDedicatedSubnet must be named RouteServerSubnet; /27 minimum
Azure NAT GatewayDelegationAssociated via subnet property, not a formal delegation; can share subnet
Azure API Management (internal/external VNet mode)DedicatedRecommended dedicated; NSG and UDR requirements make sharing impractical
Databases & analytics
Azure SQL Managed InstanceBothDedicated subnet + delegate to Microsoft.Sql/managedInstances; /27 minimum
Azure Database for MySQL Flexible ServerBothDedicated subnet + delegate to Microsoft.DBforMySQL/flexibleServers
Azure Database for PostgreSQL Flexible ServerBothDedicated subnet + delegate to Microsoft.DBforPostgreSQL/flexibleServers
Azure Cosmos DB (managed private endpoint)DelegationDelegate to Microsoft.AzureCosmosDB/clusters for dedicated gateway
Azure HDInsightDedicatedComplex NSG rules make sharing unsafe; dedicated strongly recommended
Azure Databricks (VNet injection)BothTwo dedicated subnets (public + private); delegate both to Microsoft.Databricks/workspaces
Azure Synapse Analytics (managed VNet)DelegationDelegate to Microsoft.Synapse/workspaces
Integration & security
Azure Logic Apps (Standard, VNet Integration)DelegationDelegate to Microsoft.Web/serverFarms; same as App Service
Azure API Management (Premium, VNet injected)DedicatedOne subnet per deployment region; /29 or larger
Azure NetApp FilesBothDedicated subnet + delegate to Microsoft.Netapp/volumes; /28 minimum
Azure Machine Learning compute clustersDedicatedDedicated subnet recommended to isolate training workloads
Azure Spring AppsBothTwo dedicated subnets (service runtime + apps); delegate to Microsoft.AppPlatform/Spring

If you’re building Landing Zones for enterprise workloads, you will encounter a significant number of these. Quite possibly in the same deployment cycle.

It’s also worth noting that Microsoft surfaces the delegation identifier strings (like Microsoft.Sql/managedInstances) in the portal when you configure a subnet — but only once you know to look there. These identifiers are also what you’ll specify in your IaC templates, so knowing the right string for each service before you deploy is part of the preparation work.

Why This Catches Architects Out

There’s a pattern worth naming here, because it’s the reason this catches people who really should know better — including architects who’ve been working in Azure for years.

When you build a Solution on Azure or any other cloud platform, you make a lot of network design decisions up front: address space, subnets, NSGs, route tables, peerings, DNS. These decisions form a mental model of the network, and that model tends to stay fairly stable once the design is locked.

Subnet delegation is easy to miss in that process because it isn’t a networking concept in the traditional sense. You’re not configuring routing, access control, or address space. You’re assigning ownership of a subnet to a service. That’s a different kind of decision, and it lives in a different part of the portal to everything else you’re configuring.

During a deployment, when the pressure is on and the clock is running, nobody goes back to check the delegation tab unless they already know delegation is the issue. And you only know delegation is the issue once you’ve already ruled out everything else.

What the Fix Actually Looks Like

Once you know what you’re looking for, the resolution is straightforward.

In the Azure Portal, navigate to the subnet, open the delegation settings, and assign the appropriate service. The delegation options available correspond to the services that support or require it — you select the right one, save, and retry the deployment.

That’s it. That’s the ninety-second fix.

In Terraform, it looks like this:

delegation {
name = "sql-managed-instance-delegation"
service_delegation {
name = "Microsoft.Sql/managedInstances"
actions = [
"Microsoft.Network/virtualNetworks/subnets/join/action",
"Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
"Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action"
]
}
}

If you’re deploying via infrastructure-as-code — which you should be for any Landing Zone work — delegation needs to be defined in the subnet configuration from the start, not added reactively when a deployment fails.

Conclusion

Subnet delegation is a small concept with an outsized potential to cause problems during deployments and migrations. The key points:

  • Some PaaS services require exclusive control over a dedicated, delegated subnet
  • Deployment failures caused by missing delegation are poorly surfaced by Azure’s error messaging, which means diagnosis takes much longer than the fix
  • The services that require delegation include SQL Managed Instance, Container Instances, Databricks, Container Apps, and NetApp Files — these are common enterprise workloads, not edge cases
  • The fix, once identified, takes about ninety seconds
  • The right response is to make delegation a first-class design consideration in your subnet inventory and Landing Zone documentation

And if you haven’t hit this yet: now you’ll know what you’re looking at before the next deployment.

“Why a Landing Zone?”: How to avoid Azure sprawl from day 1 (and still move fast)

A Landing Zone is never the first thought when a project starts. When the pressure is on to deliver something fast in Azure (or any other cloud environment, the simplest path looks like this:

  • Create a subscription
  • Throw resources into a few Resource Groups
  • Build a VNet (or two)
  • Add some NSGs
  • Ship it

Its a good approach ….. for a Proof Of Concept ….

Here’s the problem though: POC’s keep going and turn into Production environments. Because “we need to go fast….”.

What begins as speed often turns into sprawl, and this isn’t a problem until 30/60/180 days later, when you’ve got multiple teams, multiple environments, and everyone has been “needing to go fast”. And its all originated from that first POC …..

This post is about the pain points that appear when you skip foundations, and more importantly, how you can avoid them from day 1, using the Azure Landing Zone reference architectures as your guardrails and your blueprint.


This is always how it starts….

The business says:

“We need this workload live in Azure quickly.”

The delivery team says:

“No problem. We’ll deploy the services into a Resource Group, lock down the VNet with NSGs, and we’ll worry about the platform stuff later.”

Ops and Security quietly panic (or as per the above example, get thrown out the window….), but everyone’s under pressure, so you crack on.

At this point nobody is trying to build a mess. Everyone is “trying” to do the right thing. But the POC you build in those early days has a habit of becoming “the environment” — the one you’re still using a year later, except now it’s full of exceptions, one-off decisions, and “temporary” fixes that never got undone.


The myth: “Resource Groups + VNets + NSGs = foundation”

Resource Groups are useful. VNets are essential. NSGs absolutely have their place.

But if your “platform strategy” starts and ends there, you haven’t built a foundation — you’ve built a starting configuration.

Azure Landing Zones exist to give you that repeatable foundation: a scalable, modular architecture with consistent controls that can be applied across subscriptions as you grow.


The pain points that show up after the first few workloads

1) Governance drift (a.k.a. “every team invents their own standards”)

You start with one naming convention. Then a second team arrives and uses something else. Tags are optional, so they’re inconsistent. Ownership becomes unclear. Cost reporting turns into detective work.

Then you try to introduce standards later and discover:

  • Hundreds of resources without tags
  • Naming patterns that can’t be fixed without redeploying and breaking things
  • “Environment” means different things depending on who you ask

The best time to enforce consistency is before you have 500 things deployed. Landing Zones bring governance forward. Not as a blocker, but as a baseline: policies, conventions, and scopes that make growth predictable.


2) RBAC sprawl (“temporary Owner” becomes permanent risk)

If you’ve ever inherited an Azure estate, environments tend to have patterns like:

  • “Give them Owner, we’ll tighten it later.”
  • “Add this service principal as Contributor everywhere just to get the pipeline working.”
  • “We need to unblock the vendor… give them access for now.”

Fast-forward a few months and you have:

  • Too many people with too much privilege
  • No clean separation between platform access and workload access
  • Audits and access reviews that are painful and slow

This is where Landing Zones help in a very simple way. The platform team owns the platform. Workload teams own their workloads. And the boundaries are designed into the management group and subscription model, not “managed” by tribal knowledge.


3) Network entropy (“just one more VNet”)

Networking is where improvisation becomes expensive. It starts with:

  • a VNet for the first app
  • a second VNet for the next one
  • a peering here
  • another peering there
  • and then one day someone asks: “What can talk to what?”

And nobody can answer confidently without opening a diagram that looks like spaghetti.

The Azure guidance here is very clear: adopt a deliberate topology (commonly hub-and-spoke) so you centralise shared services, inspection, and connectivity patterns.


4) Subscription blast radius (“one subscription becomes the junk drawer”)

This is one of the biggest “resource group isn’t enough” realities. Resource Groups are not strong boundaries for:

  • quotas and limits
  • policy scope management at scale
  • RBAC complexity
  • cost separation across teams/products
  • incident and breach containment

When everything lives in one subscription, one bad decision has a very wide blast radius. Landing Zones push you toward using subscriptions as a unit of scale, and setting up management groups so you can apply guardrails consistently across them.


So what is a Landing Zone, practically?

In a nutshell, a Landing Zone is the foundation to everything you will do in future in your cloud estate.

The platform team builds a standard, secure, repeatable environment. Application teams ship fast on top of it, without having to re-invent governance, networking, and security every time.

The Azure Landing Zone reference architecture is opinionated for a reason — it gives you a proven starting point that you tailor to your needs.

And it’s typically structured into two layers:

Image Credit: Microsoft

Platform landing zone

Shared services and controls, such as:

  • identity and access foundations
  • connectivity patterns
  • management and monitoring
  • security baselines

Application landing zones

Workload subscriptions where teams deploy their apps and services — with autonomy inside guardrails.

This separation is the secret sauce. The platform stays boring and consistent. The workloads move fast.


Avoiding sprawl from day 1: a simple blueprint

If you want the practical “do this first” guidance, here it is.

1) Don’t freestyle: use the design areas as your checklist

Microsoft’s Cloud Adoption Framework breaks landing zone design into clear design areas. Treat these as your “day-1 decisions” checklist.

Even if you don’t implement everything on day 1, you should decide:

  • Identity and access: who owns what, where privilege lives
  • Resource organisation: management group hierarchy and subscription model
  • Network topology: hub-and-spoke / vWAN direction, IP plan, connectivity strategy
  • Governance: policies, standards, and scope
  • Management: logging, monitoring, operational ownership

The common failure mode is building workloads first, then trying to reverse-engineer these decisions later.


2) Make subscriptions your unit of scale (and stop treating “one sub” as a platform)

If you want to avoid a single subscription becoming a dumping ground, you need a repeatable way to create new workload subscriptions with the right baseline baked in.

This is where subscription vending comes in.

Subscription vending is basically: “new workload subscriptions are created in a consistent, governed way” — with baseline policies, RBAC, logging hooks, and network integration applied as part of the process.

If you can’t create a new compliant subscription easily, you will end up reusing the first one forever… and that’s how sprawl wins.


3) Choose a network pattern early (then standardise it)

Most of the time, the early win is adopting hub-and-spoke:

  • spokes for workloads
  • a hub for shared services and central control
  • consistent ingress/egress and inspection patterns

The point isn’t that hub-and-spoke is “cool” – it gives you a consistent story for connectivity and control.


4) Guardrails that don’t kill speed

This is where people get nervous. They hear “Landing Zone” and think bureaucracy. But guardrails are only slow when they’re manual. Good guardrails are automated and predictable, like:

  • policy baselines for common requirements
  • naming/tagging standards that are enforced early
  • RBAC patterns that avoid “Owner everywhere”
  • logging and diagnostics expectations so ops isn’t blind

This is how you enable teams to move quickly without turning your subscription into a free-for-all.


How can you actually implement this?

Don’t build it from scratch. Use the Azure Landing Zone reference architecture as your baseline, then implement via an established approach (and put it in version control from the start). The landing zone architecture is designed to be modular for exactly this reason: you can start small and evolve without redesigning everything.

Treat it like a product:

  • define what a “new workload environment” looks like
  • automate the deployment of that baseline
  • iterate over time

The goal is not to build the perfect enterprise platform on day 1; its to build something that won’t collapse under its own weight when you scale.


A “tomorrow morning” checklist

If you’re reading this and thinking “right, what do I actually do next?”, here are four actions that deliver disproportionate value:

  1. Decide your management group + subscription strategy
  2. Pick your network topology (and standardise it)
  3. Define day-1 guardrails (policy baseline, RBAC patterns, naming/tags, logging hooks)
  4. Set up subscription vending so new workloads start compliant by default

Do those four things, and you’ll avoid the worst kind of Azure sprawl before it starts.


Conclusion

Skipping a Landing Zone might feel like a quick win today.

But if you know the workload is going to grow — more teams, more environments, more services, more scrutiny — then the question isn’t “do we need a landing zone?”

The question is: do we want to pay for foundations now… or pay a lot more later when we (inevitably) lose control?

Hope you enjoyed this post – this is my contribution to this years Azure Spring Clean event organised by Joe Carlyle and Thomas Thornton. Check out the full schedule on the website!

AKS Networking – Ingress and Egress Traffic Flow

In the previous post on AKS Networking, we explored the different networking models available in AKS and how IP strategy, node pool scaling, and control plane connectivity shape a production-ready cluster. Now we move from how the cluster is networked to how traffic actually flows through it.

If networking defines the roads, this post is about traffic patterns, checkpoints, and border control. Understanding traffic flow is essential for reliability, security, performance, and compliance. In this post we’ll explore:

  • north–south vs east–west traffic patterns
  • ingress options and when to use each
  • internal-only exposure patterns
  • outbound (egress) control and compliance design
  • how to design predictable and secure traffic flow

Understanding Traffic Patterns in Kubernetes

Before we talk about tools, we need to talk about traffic patterns.

Like the majority of networking you will see in a traditional Hub-and-Spoke architecture, Kubernetes networking is often described using two directional models.

North–South Traffic

North–south traffic refers to traffic entering or leaving the cluster., so can be ingress (incoming) or egress (outgoing) traffic. Examples include:

Incoming

✔ Users accessing a web app
✔ Mobile apps calling APIs
✔ Partner integrations
✔ External services sending webhooks

Outgoing

✔ Calling SaaS APIs
✔ Accessing external databases
✔ Software updates & dependencies
✔ Payment gateways & third-party services

This traffic crosses trust boundaries and is typically subject to security inspection, routing, and policy enforcement.

East–West Traffic

East–west traffic refers to traffic flowing within the cluster.

Examples include:

  • microservices communicating with each other
  • internal APIs
  • background processing services
  • service mesh traffic

This traffic remains inside the cluster boundary but still requires control and segmentation in production environments.


Ingress: Getting Traffic Into the Cluster

Ingress defines how external clients reach services running inside AKS.

Image Credit: Microsoft

At its simplest, Kubernetes can expose services using a LoadBalancer service type. In production environments, however, ingress controllers provide richer routing, security, and observability capabilities.

Choosing the right ingress approach is one of the most important architectural decisions for external traffic.


Azure Application Gateway + AGIC

Azure Application Gateway with the Application Gateway Ingress Controller (AGIC) provides a native Azure Layer 7 ingress solution.

Image Credit: Microsoft

Application Gateway sits outside the cluster and acts as the HTTP/S entry point. AGIC runs inside AKS and dynamically configures routing based on Kubernetes ingress resources.

Why teams choose it

This approach integrates tightly with Azure networking and security capabilities. It enables Web Application Firewall (WAF) protection, TLS termination, path-based routing, and autoscaling.

Because Application Gateway lives in the VNet, it aligns naturally with enterprise security architectures and centralised inspection requirements.

Trade-offs

Application Gateway introduces an additional Azure resource to manage and incurs additional cost. It is also primarily designed for HTTP/S workloads.

For enterprise, security-sensitive, or internet-facing workloads, it is often the preferred choice.


Application Gateway for Containers

Application Gateway for Containers is a newer Azure-native ingress option designed specifically for Kubernetes environments. Its the natural successor to the traditional Application Gateway + AGIC model.

Image Credit: Microsoft

It integrates directly with Azure networking constructs while remaining highly performant and scalable for container-based workloads.

In practical terms, this approach allows Kubernetes resources to directly define how Application Gateway for Containers routes traffic, while Azure manages the underlying infrastructure and scaling behaviour.

Why teams choose it

Application Gateway for Containers is chosen when teams want the security and enterprise integration of Azure Application Gateway but with tighter alignment to Kubernetes-native APIs.

Because it uses the Gateway API instead of traditional ingress resources, it offers a more expressive and modern way to define traffic routing policies. This is particularly attractive for platform teams building shared Kubernetes environments where traffic routing policies need to be consistent and reusable.

Application Gateway for Containers also provides strong integration with Azure networking, private connectivity, and Web Application Firewall capabilities while improving performance compared to earlier ingress-controller models.

Trade-offs

As a newer offering, Application Gateway for Containers may require teams to become familiar with the Kubernetes Gateway API and its resource model.

There is also an additional Azure-managed infrastructure layer involved, which introduces cost considerations similar to the traditional Application Gateway approach.

However, for organisations building modern AKS platforms, Application Gateway for Containers represents a forward-looking ingress architecture that aligns closely with Kubernetes networking standards.

Jack Stromberg has written an extensive post on the functionality of AGC and the migration paths from AGIC and Ingress, check it out here


NGINX Ingress Controller

The NGINX Ingress Controller is one of the most widely used ingress solutions in Kubernetes. It runs as pods inside the cluster and provides highly flexible routing, TLS handling, and traffic management capabilities.

Image Credit: Microsoft

And its retiring ….. well, at least the managed version is.

Microsoft is retiring the managed NGINX Ingress with the Application Routing add-on, with support ending in November 2026. The upstream Ingress-NGINX project is being deprecated, so the managed offering is being retired.

However, you still have the option to run your own NGINX Ingress inside the cluster. Requires more management overhead, but …..

Why teams choose it

NGINX provides fine-grained routing control and is cloud-agnostic. Teams with existing Kubernetes experience often prefer its flexibility and maturity.

It supports advanced routing patterns, rate limiting, and traffic shaping, making it suitable for complex application architectures.

Trade-offs

Because NGINX runs inside the cluster, you are responsible for scaling, availability, and lifecycle management. Security features such as WAF capabilities require additional configuration or integrations.

NGINX is ideal when flexibility and portability outweigh tight platform integration.


Istio Ingress Gateway

The final ingress approach to cover is the Istio Ingress Gateway, typically deployed as part of a broader service mesh architecture.

When using Istio on AKS, the ingress gateway acts as the entry point for traffic entering the service mesh. It is built on the Envoy proxy and integrates tightly with Istio’s traffic management, security, and observability features.

Rather than acting purely as a simple edge router, the Istio ingress gateway becomes part of the overall service mesh control model. This means that external traffic entering the cluster can be governed by the same policies that control internal service-to-service communication.

Why teams choose it

Teams typically adopt the Istio ingress gateway when they are already using — or planning to use — a service mesh.

One of the main advantages is advanced traffic management. Istio enables sophisticated routing capabilities such as weighted routing, canary deployments, A/B testing, and header-based routing. These patterns are extremely useful in microservice architectures where controlled rollout strategies are required.

Another major benefit is built-in security capabilities. Istio can enforce mutual TLS (mTLS) between services, allowing ingress traffic to integrate directly into a zero-trust communication model across the cluster.

Istio also provides strong observability through integrated telemetry, tracing, and metrics. Because Envoy proxies sit on the traffic path, detailed insight into request flows becomes available without modifying application code.

For platform teams building large-scale internal platforms, these capabilities allow ingress traffic to participate fully in the platform’s traffic policy, security posture, and monitoring framework.

Trade-offs

Istio comes with additional operational complexity. Running a service mesh introduces additional control plane components and sidecar proxies that consume compute and memory resources.

Clusters using Istio typically require careful node pool sizing and resource planning to ensure the mesh infrastructure itself does not compete with application workloads.

Operationally, teams must also understand additional concepts such as virtual services, destination rules, gateways, and mesh policies.

I’ll dive into more detail on the concept of Service Mesh in a future post.


Internal Ingress Patterns

Many production clusters expose workloads internally using private load balancers and internal ingress controllers.

This pattern is common when:

  • services are consumed only within the VNet
  • private APIs support internal platforms
  • regulatory or security controls restrict public exposure

Internal ingress allows organisations to treat AKS as a private application platform rather than a public web hosting surface.


Designing for Ingress Resilience

Ingress controllers are part of the application data path. If ingress fails, applications become unreachable. Production considerations include:

  • running multiple replicas
  • placing ingress pods across availability zones
  • ensuring node pool capacity for scaling
  • monitoring latency and saturation

East–West Traffic and Microservice Communication

Within the cluster, services communicate using Kubernetes Services and DNS.

This abstraction allows pods to scale, restart, and move without breaking connectivity. In production environments, unrestricted east–west traffic can create security and operational risk.

Network Policies allow you to restrict communication between workloads, enabling microsegmentation inside the cluster. This is a foundational step toward zero-trust networking principles.

Some organisations also introduce service meshes to provide:

  • mutual TLS between services
  • traffic observability
  • policy enforcement

While not always necessary, these capabilities become valuable in larger or security-sensitive environments.


Egress: Controlling Outbound Traffic

Outbound traffic is often overlooked during early deployments. However, in production environments, controlling egress is critical for security, compliance, and auditability. Workloads frequently need outbound access for:

  • external APIs
  • package repositories
  • identity providers
  • logging and monitoring services

NAT Gateway and Predictable Outbound IP

With the imminent retirement of Default Outbound Access fast approaching, Microsoft’s general recommendation is to use Azure NAT Gateway to provide a consistent outbound IP address for cluster traffic.

Image Credit: Microsoft

This is essential when external systems require IP allow-listing. It also improves scalability compared to default outbound methods.


Azure Firewall and Centralised Egress Control

Many enterprise environments route outbound traffic through Azure Firewall or network virtual appliances. This enables:

  • traffic inspection
  • policy enforcement
  • logging and auditing
  • domain-based filtering
Image Credit: Microsoft

This pattern supports regulatory and compliance requirements while maintaining central control over external connectivity.


Private Endpoints and Service Access

Whenever possible, Azure PaaS services should be accessed via Private Endpoints. This keeps traffic on the Azure backbone network and prevents exposure to the public internet.

Combining private endpoints with controlled egress significantly reduces the attack surface.


Designing Predictable Traffic Flow

Production AKS platforms favour predictability over convenience.

That means:

  • clearly defined ingress entry points
  • controlled internal service communication
  • centralised outbound routing
  • minimal public exposure

This design improves observability, simplifies troubleshooting, and strengthens security posture.


Aligning Traffic Design with the Azure Well-Architected Framework

Operational Excellence improves when traffic flows are observable and predictable.

Reliability depends on resilient ingress and controlled outbound connectivity.

Security is strengthened through restricted exposure, network policies, and controlled egress.

Cost Optimisation improves when traffic routing avoids unnecessary hops and oversized ingress capacity.


What Comes Next

At this point in the series, we have designed:

  • the AKS architecture
  • networking and IP strategy
  • control plane connectivity
  • ingress, egress, and service traffic flow

In the next post, we turn to identity and access control. Because

  • Networking defines connectivity.
  • Traffic design defines flow.
  • Identity defines trust.

See you on the next post

AKS Networking – Which model should you choose?

In the previous post, we broke down AKS Architecture Fundamentals — control plane vs data plane, node pools, availability zones, and early production guardrails.

Now we move into one of the most consequential design areas in any AKS deployment:

Networking.

If node pools define where workloads run, networking defines how they communicate — internally, externally, and across environments.

Unlike VM sizes or replica counts, networking decisions are difficult to change later. They shape IP planning, security boundaries, hybrid connectivity, and how your platform evolves over time.

This post takes a look at AKS networking by exploring:

  • The modern networking options available in AKS
  • Trade-offs between Azure CNI Overlay and Azure CNI Node Subnet
  • How networking decisions influence node pool sizing and scaling
  • How the control plane communicates with the data plane

Why Networking in AKS Is Different

With traditional Iaas and PaaS services in Azure, networking is straightforward: a VM or resource gets an IP address in a subnet.

With Kubernetes, things become layered:

  • Nodes have IP addresses
  • Pods have IP addresses
  • Services abstract pod endpoints
  • Ingress controls external access

AKS integrates all of this into an Azure Virtual Network. That means Kubernetes networking decisions directly impact:

  • IP address planning
  • Subnet sizing
  • Security boundaries
  • Peering and hybrid connectivity

In production, networking is not just connectivity — it’s architecture.


The Modern AKS Networking Choices

Although there are some legacy models still available for use, if you try to deploy an AKS cluster in the Portal you will see that AKS offers two main networking approaches:

  • Azure CNI Node Subnet (flat network model)
  • Azure CNI Overlay (pod overlay networking)

As their names suggest, both use Azure CNI. The difference lies in how pod IP addresses are assigned and routed. Understanding this distinction is essential before you size node pools or define scaling limits.


Azure CNI Node Subnet

This is the traditional Azure CNI model.

Pods receive IP addresses directly from the Azure subnet. From the network’s perspective, pods appear as first-class citizens inside your VNet.

How It Works

Each node consumes IP addresses from the subnet. Each pod scheduled onto that node also consumes an IP from the same subnet. Pods are directly routable across VNets, peered networks, and hybrid connections.

This creates a flat, highly transparent network model.

Why teams choose it

This model aligns naturally with enterprise networking expectations. Security appliances, firewalls, and monitoring tools can see pod IPs directly. Routing is predictable, and hybrid connectivity is straightforward.

If your environment already relies on network inspection, segmentation, or private connectivity, this model integrates cleanly.

Pros

  • Native VNet integration
  • Simple routing and peering
  • Easier integration with existing network appliances
  • Straightforward hybrid connectivity scenarios
  • Cleaner alignment with enterprise security tooling

Cons

  • High IP consumption
  • Requires careful subnet sizing
  • Can exhaust address space quickly in large clusters

Trade-offs to consider

The trade-off is IP consumption. Every pod consumes a VNet IP. In large clusters, address space can be exhausted faster than expected. Subnet sizing must account for:

  • node count
  • maximum pods per node
  • autoscaling limits
  • upgrade surge capacity

This model rewards careful planning and penalises underestimation.

Impact on node pool sizing

With Node Subnet networking, node pool scaling directly consumes IP space.

If a user node pool scales out aggressively and each node supports 30 pods, IP usage grows rapidly. A cluster designed for 100 nodes may require thousands of available IP addresses.

System node pools remain smaller, but they still require headroom for upgrades and system pod scheduling.


Azure CNI Overlay

Azure CNI Overlay is designed to address IP exhaustion challenges while retaining Azure CNI integration.

Pods receive IP addresses from an internal Kubernetes-managed range, not directly from the Azure subnet. Only nodes consume Azure VNet IP addresses.

How It Works

Nodes are addressable within the VNet. Pods use an internal overlay CIDR range. Traffic is routed between nodes, with encapsulation handling pod communication.

From the VNet’s perspective, only nodes consume IP addresses.

Why teams choose it

Overlay networking dramatically reduces pressure on Azure subnet address space. This makes it especially attractive in environments where:

  • IP ranges are constrained
  • multiple clusters share network space
  • growth projections are uncertain

It allows clusters to scale without re-architecting network address ranges.

Pros

  • Significantly lower Azure IP consumption
  • Simpler subnet sizing
  • Useful in environments with constrained IP ranges

Cons

  • More complex routing
  • Less transparent network visibility
  • Additional configuration required for advanced scenarios
  • Not ideal for large-scale enterprise integration

Trade-offs to consider

Overlay networking introduces an additional routing layer. While largely transparent, it can add complexity when integrating with deep packet inspection, advanced network appliances, or highly customised routing scenarios.

For most modern workloads, however, this complexity is manageable and increasingly common.

Impact on node pool sizing

Because pods no longer consume VNet IP addresses, node pool scaling pressure shifts away from subnet size. This provides greater flexibility when designing large user node pools or burst scaling scenarios.

However, node count, autoscaler limits, and upgrade surge requirements still influence subnet sizing.


Choosing Between Overlay and Node Subnet

Here are the “TLDR” considerations when you need to make the choice of which networking model to use:

  • If deep network visibility, firewall inspection, and hybrid routing transparency are primary drivers, Node Subnet networking remains compelling.
  • If address space constraints, growth flexibility, and cluster density are primary concerns, Overlay networking provides significant advantages.
  • Most organisations adopting AKS at scale are moving toward overlay networking unless specific networking requirements dictate otherwise.

How Networking Impacts Node Pool Design

Let’s connect this back to the last post, where we said that Node pools are not just compute boundaries — they are networking consumption boundaries.

System Node Pools

System node pools:

  • Host core Kubernetes components
  • Require stability more than scale

From a networking perspective:

  • They should be small
  • They should be predictable in IP consumption
  • They must allow for upgrade surge capacity

If using Azure CNI, ensure sufficient IP headroom for control plane-driven scaling operations.

User Node Pools

User node pools are where networking pressure increases. Consider:

  • Maximum pods per node
  • Horizontal Pod Autoscaler behaviour
  • Node autoscaling limits

In Azure CNI Node Subnet environments, every one of those pods consumes an IP. If you design for 100 nodes with 30 pods each, that is 3,000 pod IPs — plus node IPs. Subnet planning must reflect worst-case scale, not average load.

In Azure CNI Overlay environments, the pressure shifts away from Azure subnets — but routing complexity increases.

Either way, node pool design and networking are a single architectural decision, not two separate ones.


Control Plane Networking and Security

One area that is often misunderstood is how the control plane communicates with the data plane, and how administrators securely interact with the cluster.

The Kubernetes API server is the central control surface. Every action — whether from kubectl, CI/CD pipelines, GitOps tooling, or the Azure Portal — ultimately flows through this endpoint.

In AKS, the control plane is managed by Azure and exposed through a secure endpoint. How that endpoint is exposed defines the cluster’s security posture.

Public Cluster Architecture

By default, AKS clusters expose a public API endpoint secured with authentication, TLS, and RBAC.

This does not mean the cluster is open to the internet. Access can be restricted using authorized IP ranges and Azure AD authentication.

Image: Microsoft/Houssem Dellai

Key characteristics:

  • API endpoint is internet-accessible but secured
  • Access can be restricted via authorized IP ranges
  • Nodes communicate outbound to the control plane
  • No inbound connectivity to nodes is required

This model is common in smaller environments or where operational simplicity is preferred.

Private Cluster Architecture

In a private AKS cluster, the API server is exposed via a private endpoint inside your VNet.

Image: Microsoft/Houssem Dellai

Administrative access requires private connectivity such as:

  • VPN
  • ExpressRoute
  • Azure Bastion or jump hosts

Key characteristics:

  • API server is not exposed to the public internet
  • Access is restricted to private networks
  • Reduced attack surface
  • Preferred for regulated or enterprise environments

Control Plane to Data Plane Communication

Regardless of public or private mode, communication between the control plane and the nodes follows the same secure pattern.

The kubelet running on each node establishes an outbound, mutually authenticated connection to the API server.

This design has important security implications:

  • Nodes do not require inbound internet exposure
  • Firewall rules can enforce outbound-only communication
  • Control plane connectivity remains encrypted and authenticated

This outbound-only model is a key reason AKS clusters can operate securely inside tightly controlled network environments.

Common Networking Pitfalls in AKS

Networking issues rarely appear during initial deployment. They surface later when scaling, integrating, or securing the platform. Typical pitfalls include:

  • subnets sized for today rather than future growth
  • no IP headroom for node surge during upgrades
  • lack of outbound traffic control
  • exposing the API server publicly without restrictions

Networking issues rarely appear on day one. They appear six months later — when scaling becomes necessary.


Aligning Networking with the Azure Well-Architected Framework

  • Operational Excellence improves when networking is designed for observability, integration, and predictable growth.
  • Reliability depends on zone-aware node pools, resilient ingress, and stable outbound connectivity.
  • Security is strengthened through private clusters, controlled egress, and network policy enforcement.
  • Cost Optimisation emerges from correct IP planning, right-sized ingress capacity, and avoiding rework caused by subnet exhaustion.

Making the right (or wrong) networking decisions in the design phase has an effect across each of these pillars.


What Comes Next

At this point in the series, we now understand:

  • Why Kubernetes exists
  • How AKS architecture is structured
  • How networking choices shape production readiness

In the next post, we’ll stay on the networking theme and take a look at Ingress and Egress traffic flows. See you then!

AKS Architecture Fundamentals

In the previous post From Containers to Kubernetes Architecture, we walked through the evolution from client/server to containers, and from Docker to Kubernetes. We looked at how orchestration became necessary once we stopped deploying single applications to single servers.

Now it’s time to move from history to design, and in this post we’re going to dive into the practical by focusing on:

How Azure Kubernetes Service (AKS) is actually structured — and what architectural decisions matter from day one.


Control Plane vs Data Plane – The First Architectural Boundary

In line with the core design of a vanilla Kubernetes cluster, every AKS cluster is split into two logical areas:

  • The control plane (managed by Azure)
  • The data plane (managed by you)
Image Credit – Microsoft

We looked at this in the last post, but lets remind ourselves of the components that make up each area.

The Control Plane (Azure Managed)

When you create an AKS cluster, you do not deploy your own API server or etcd database. Microsoft runs the Kubernetes control plane for you.

That includes:

  • The Kubernetes API server
  • etcd (the cluster state store)
  • The scheduler
  • The controller manager
  • Control plane patching and upgrades

This is not just convenience — it is risk reduction. Operating a highly available Kubernetes control plane is non‑trivial. It requires careful configuration, backup strategies, certificate management, and upgrade sequencing.

In AKS, that responsibility shifts to Azure. You interact with the cluster via the Kubernetes API (through kubectl, CI/CD pipelines, GitOps tools, or the Azure Portal), but you are not responsible for keeping the brain of the cluster alive.

That abstraction directly supports:

  • Operational Excellence
  • Reduced blast radius
  • Consistent lifecycle management

It also empowers Operations to enable Development teams to start their cycles earlier in the project as opposed to waiting for the control plane to be stood up and functionally ready.

The Data Plane (Customer Managed)

The data plane is where your workloads run. This consists of:

  • Virtual machine nodes
  • Node pools
  • Pods and workloads
  • Networking configuration

You choose:

  • VM SKU
  • Scaling behaviour
  • Availability zones
  • OS configuration (within supported boundaries)

This is intentional. Azure abstracts complexity where it makes sense, but retains control and flexibility where architecture matters.


Node Pools – Designing for Isolation and Scale

One of the most important AKS concepts is the node pool. A node pool is a group of VMs with the same configuration. At first glance, it may look like a scaling convenience feature. In production, it is an isolation and governance boundary.

There are 2 different types of node pool – System and User.

System Node Pool

Every AKS cluster requires at least one system node pool, which is a specialized group of nodes dedicated to hosting critical cluster components. While you can run application pods on them, their primary role is ensuring the stability of core services.

This pool runs:

  • Core Kubernetes components
  • Critical system pods

In production, this pool should be:

  • Small but resilient
  • Dedicated to system workloads
  • Not used for business application pods

In our first post, we took the default “Node Pool” option – however you do have the option to add a dedicated system node pool:

It is recommended that you create a dedicated system node pool to isolate critical system pods from your application pods to prevent misconfigured or rogue application pods from accidentally deleting system pods.

You don’t need the system node pool SKU size to be the same as your user node Pool SKU size – however its recommended that you make both highly available.

User Node Pools

User node pools are where your applications and workloads run. You can create multiple pools for different purposes within the same AKS cluster:

  • Compute‑intensive workloads
  • GPU workloads
  • Batch jobs
  • Isolated environments

Looking at the list, traditionally these workloads would have lived on their own dedicated hardware or processing areas. The advantage of AKS is that this enables:

  • Better scheduling control – the scheduler can control the assignment of resources to workloads in node pools.
  • Resource isolation – resources are isolated in their own node pools.
  • Cost optimisation – all of this runs on the same set of cluster VMs, so cost is predictive and stable.

In production, multiple node pools are not optional — they are architectural guardrails.


Regional Design and Availability Zones

Like all Azure resources, when you create an AKS cluster you choose a region. That decision impacts latency, compliance, resilience, and cost.

But production architecture requires a deeper question:

How does this cluster handle failure?

Azure supports availability zones in many regions. A production AKS cluster should at a bare minimum:

  • Use zone‑aware node pools
  • Distribute nodes across multiple availability zones

This ensures that a single data centre failure does not bring down your workloads. It’s important to understand that:

  • The control plane is managed by Azure for high availability
  • You are responsible for ensuring node pool zone distribution

Availability is shared responsibility.


Networking Considerations (At a High Level)

AKS integrates into an Azure Virtual Network. That means your cluster:

  • Has IP address planning implications
  • Participates in your broader network topology
  • Must align with security boundaries

Production mistakes often start here:

  • Overlapping address spaces
  • Under‑sized subnets
  • No separation between environments

Networking is not a post‑deployment tweak – its a day‑one design decision that you make with your wider cloud and architecture teams. We’ll go deep into networking in the next post, but even at the architecture stage, you need to make conscious choices.


Upgrade Strategy – Often Ignored, Always Critical

Kubernetes evolves quickly. AKS supports multiple Kubernetes versions, but older versions are eventually deprecated. The full list of supported AKS versions can be found at the AKS Release Status page

A production architecture must consider:

  • Version lifecycle
  • Upgrade cadence
  • Node image updates

In AKS, control plane upgrades are managed by Azure — but you control when upgrades occur and how node pools are rolled. When you create your cluster, you can specify the upgrade option you wish to use:

Its important to pay attention to this as it may affect your workloads. For example, starting in Kubernetes v1.35, Ubuntu 24.04 because the default OS SKU. What this means if you are upgrading from a lower version of Kubernetes, your node OS will be auto upgraded from Ubuntu 22.04 to 24.04.

Ignoring upgrade planning is one of the fastest ways to create technical debt in a cluster. This is why testing in lower environments to see how your workloads react to these upgrades is vital.


Mapping This to the Azure Well‑Architected Framework

Let’s anchor this back to the bigger picture and see how AKS maps to the Azure Well-Architected Framework.

Operational Excellence

  • AKS’s managed control plane reduces operational complexity.
  • Node pools introduce structured isolation.
  • Availability zones improve resilience.

Designing with these in mind from the start prevents reactive firefighting later.

Reliability

Zone distribution, multiple node pools, and scaling configuration directly influence workload uptime. Reliability is not added later — it is designed at cluster creation.

Cost Optimisation

Right‑sizing node pools and separating workload types prevents over‑provisioning. Production clusters that mix everything into one large node pool almost always overspend.


Production Guardrails – Early Principles

Before we move into deeper topics in the next posts, let’s establish a few foundational guardrails:

  • Separate system and user node pools
  • Use availability zones where supported
  • Plan IP addressing before deployment (again, we’ll dive into networking and how it affects workloads in more detail in the next post)
  • Treat upgrades as part of operations, not emergencies
  • Avoid “single giant node pool” design

These are not advanced optimisations. They are baseline expectations for production AKS.


What Comes Next

Now that we understand AKS architecture fundamentals, the next logical step is networking.

In the next post, we’ll go deep into:

  • Azure CNI and networking models
  • Ingress and traffic flow
  • Internal vs external exposure
  • Designing secure network boundaries

Because once architecture is clear, networking is what determines whether your cluster is merely functional or truly production ready.

See you on the next post!

What Is Azure Kubernetes Service (AKS) and Why Should You Care?

In every cloud native architecture discussion you have had over the last few years or are going to have in the coming years, you can be guaranteed that someone has or will introduce Kubernetes as a hosting option on which your solution will run.

There’s also different options when Kubernetes enters the conversation – you can choose to run:

Kubernetes promises portability, scalability, and resilience. In reality, operating Kubernetes yourself is anything but simple.

Have you’ve ever wondered whether Kubernetes is worth the complexity—or how to move from experimentation to something you can confidently run in production?

Me too – so let’s try and answer that question. For anyone who knows me or has followed me for a few years knows, I like to get down to the basics and “start at the start”.

This is the first post is of a blog series where we’ll focus on Azure Kubernetes Service (AKS), while also referencing the core Kubernetes offerings as a reference. The goal of this series is:

By the end (whenever that is – there is no set time or number of posts), we will have designed and built a production‑ready AKS cluster, aligned with the Azure Well‑Architected Framework, and suitable for real‑world enterprise workloads.

With the goal clearly defined, let’s start at the beginning—not by deploying workloads or tuning YAML, but by understanding:

  • Why AKS exists
  • What problems it solves
  • When it’s the right abstraction.

What Is Azure Kubernetes Service (AKS)?

Azure Kubernetes Service (AKS) is a managed Kubernetes platform provided by Microsoft Azure. It delivers a fully supported Kubernetes control plane while abstracting away much of the operational complexity traditionally associated with running Kubernetes yourself.

At a high level:

  • Azure manages the Kubernetes control plane (API server, scheduler, etcd)
  • You manage the worker nodes (VM size, scaling rules, node pools)
  • Kubernetes manages your containers and workloads

This division of responsibility is deliberate. It allows teams to focus on applications and platforms rather than infrastructure mechanics.

You still get:

  • Native Kubernetes APIs
  • Open‑source tooling (kubectl, Helm, GitOps)
  • Portability across environments

But without needing to design, secure, patch, and operate Kubernetes from scratch.

Why Should You Care About AKS?

The short answer:

AKS enables teams to build scalable platforms without becoming Kubernetes operators.

The longer answer depends on the problems you’re solving.

AKS becomes compelling when:

  • You’re building microservices‑based or distributed applications
  • You need horizontal scaling driven by demand
  • You want rolling updates and self‑healing workloads
  • You’re standardising on containers across teams
  • You need deep integration with Azure networking, identity, and security

Compared to running containers directly on virtual machines, AKS introduces:

  • Declarative configuration
  • Built‑in orchestration
  • Fine‑grained resource management
  • A mature ecosystem of tools and patterns

However, this series is not about adopting AKS blindly. Understanding why AKS exists—and when it’s appropriate—is essential before we design anything production‑ready.


AKS vs Azure PaaS Services: Choosing the Right Abstraction

Another common—and more nuanced—question is:

“Why use AKS at all when Azure already has PaaS services like App Service or Azure Container Apps?”

This is an important decision point, and one that shows up frequently in the Azure Architecture Center.

Azure PaaS Services

Azure PaaS offerings such as App Service, Azure Functions, and Azure Container Apps work well when:

  • You want minimal infrastructure management responsibility
  • Your application fits well within opinionated hosting models
  • Scaling and availability can be largely abstracted away
  • You’re optimising for developer velocity over platform control

They provide:

  • Very low operational overhead – the service is an “out of the box” offering where developers can get started immediately.
  • Built-in scaling and availability – scaling comes as part of the service based on demand, and can be configured based on predicted loads.
  • Tight integration with Azure services – integration with tools such as Azure Monitor and Application Insights for monitoring, Defender for Security monitoring and alerting, and Entra for Identity.

For many workloads, this is exactly the right choice.

AKS

AKS becomes the right abstraction when:

  • You need deep control over networking, runtime, and scheduling
  • You’re running complex, multi-service architectures
  • You require custom security, compliance, or isolation models
  • You’re building a shared internal platform rather than a single application

AKS sits between IaaS and fully managed PaaS:

Azure PaaS abstracts the platform for you. AKS lets you build the platform yourself—safely.

This balance of control and abstraction is what makes AKS suitable for production platforms at scale.


Exploring AKS in the Azure Portal

Before designing anything that could be considered “production‑ready”, it’s important to understand what Azure exposes out of the box – so lets spin up an AKS instance using the Azure Portal.

Step 1: Create an AKS Cluster

  • Sign in to the Azure Portal
  • In the search bar at the top, Search for Kubernetes Service
  • When you get to the “Kubernetes center page”, click on “Clusters” on the left menu (it should bring you here automatically). Select Create, and select “Kubernetes cluster”. Note that there are also options for “Automatic Kubernetes cluster” and “Deploy application” – we’ll address those in a later post.
  • Choose your Subscription and Resource Group
  • Enter a Cluster preset configuration, Cluster name and select a Region. You can choose from four different preset configurations which have clear explanations based on your requirements
  • I’ve gone for Dev/Test for the purposes of spinning up this demo cluster.
  • Leave all other options as default for now and click “Next” – we’ll revisit these in detail in later posts.

Step 2: Configure the Node Pool

  • Under Node pools, there is an agentpool automatically added for us. You can change this if needed to select a different VM size, and set a low min/max node count

    This is your first exposure to separating capacity management from application deployment.

    Step 3: Networking

    Under Networking, you will see options for Private/Public Access, and also for Container Networking. This is an important chopice as there are 2 clear options:

    • Azure CNI Overlay – Pods get IPs from a private CIDR address space that is separate from the node VNet.
    • Azure CNI Node Subnet – Pods get IPs directly from the same VNet subnet as the nodes.

    You also have the option to integrate this into your own VNet which you can specify during the cluster creation process.

    Again, we’ll talk more about these options in a later post, but its important to understand the distinction between the two.

    Step 4: Review and Create

    Select Review + Create – note at this point I have not selected any monitoring, security or integration with an Azure Container Registry and am just taking the defaults. Again (you’re probably bored of reading this….), we’ll deal with these in a later post dedicated to each topic.

    Once deployed, explore:

    • Node pools
    • Workloads
    • Services and ingresses
    • Cluster configuration

    Notice how much complexity is hidden – if you scroll back up to the “Azure-managed v Customer-managed” diagram, you have responsibility for managing:

    • Cluster nodes
    • Networking
    • Workloads
    • Storage

    Even though Azure abstracts away responsibility for things like key-value store, scheduler, controller and management of the cluster API, a large amount of responsibility still remains.


    What Comes Next in the Series

    This post sets the foundation for what AKS is and how it looks out of the box using a standard deployment with the “defaults”.

    Over the course of the series, we’ll move through the various concepts which will help to inform us as we move towards making design decisions for production workloads:

    • Kubernetes Architecture Fundamentals (control plane, node pools, and cluster design), and how they look in AKS
    • Networking for Production AKS (VNets, CNI, ingress, and traffic flow)
    • Identity, Security, and Access Control
    • Scaling, Reliability, and Resilience
    • Cost Optimisation and Governance
    • Monitoring, Alerting and Visualizations
    • Alignment with the Azure Well Architected Framework
    • And lots more ……

    See you on the next post!

    Top Highlights from Microsoft Ignite 2024: Key Azure Announcements

    This year, Microsoft Ignite was held in Chigaco for in-person attendees as well as virtually with key sessions live streamed. As usual, the Book of News was released to show the key announcements and you can find that at this link.

    From a personal standpoint, the Book of News was disappointing as at first glance there seemed to be very few key annoucements and enhancements being provided for core Azure Infrastructure and Networking.

    However, there were some really great reveals that were announced at various sessions throughout Ignite, and I’ve picked out some of the ones that impressed me.

    Azure Local

    Azure Stack HCI is no more ….. this is now being renamed to Azure Local. Which makes a lot more sense as Azure managed appliances deployed locally but still managed from Azure via Arc.

    So, its just a rename right? Wrong! The previous iteration was tied to specific hardware that had high costs. Azure Local now brings low spec and low cost options to the table. You can also use Azure Local in disconnected mode.

    More info can be found in this blog post and in this YouTube video.

    Azure Migrate Enhancements

    Azure Migrate is product that has badly needed some improvements and enhancements given the capabilities that some of its competitors in the market offer.

    The arrival of a Business case option enables customers to create a detailed comparison of the Total Cost of Ownership (TCO) for their on-premises estate versus the TCO on Azure, along with a year-on-year cash flow analysis as they transition their workloads to Azure. More details on that here.

    There was also an announcement during the Ignite Session around a tool called “Azure Migrate Explore” which looked like it provides you with a ready-made Business case PPT template generator that can be used to present cases to C-level. Haven’t seen this released yet, but one to look out for.

    Finally, one that may hae been missed a few months ago – given the current need for customers to migrate from VMware on-premises deployments to Azure VMware Solution (which is already built in to Azure Migrate via either Appliance or RVTools import), its good to see that there is a preview feature around a direct path from VMware to Azure Stack HCI (or Azure Local – see above). This is a step forward for customers who need to keep their workloads on-premises for things like Data Residency requirements, while also getting the power of Azure Management. More details on that one here.

    Azure Network Security Perimeter

    I must admit, this one confused me a little bit at first glance but makes sense now.

    Network Security Perimeter allows organizations to define a logical network isolation boundary for PaaS resources (for example, Azure Storage acoount and SQL Database server) that are deployed outside your organization’s virtual networks.

    So, we’re talking about services that are either deployed outside of a VNET (for whatever reason) or are using SKU’s that do not support VNET integration.

    More info can be found here.

    Azure Bastion Premium

    This has been in preview for a while but is now GA – Azure Bastion Premium offers enhanced security features such as private connectivity and graphical recordings of virtual machines connected through Bastion.

    Bastion offers enhanced security features that ensure customer virtual machines are connected securely and to monitor VMs for any anomalies that may arise.

    More info can be found here.

    Security Copilot integration with Azure Firewall

    The intelligence of Security Copilot is being integrated with Azure Firewall, which will help analysts perform detailed investigations of the malicious traffic intercepted by the IDPS feature of their firewalls across their entire fleet using natural language questions. These capabilities were launched on the Security Copilot portal and now are being integrated even more closely with Azure Firewall.

    The following capabilities can now be queried via the Copilot in Azure experience directly on the Azure portal where customers regularly interact with their Azure Firewalls: 

    • Generate recommendations to secure your environment using Azure Firewall’s IDPS feature
    • Retrieve the top IDPS signature hits for an Azure Firewall 
    • Enrich the threat profile of an IDPS signature beyond log information 
    • Look for a given IDPS signature across your tenant, subscription, or resource group 

    More details on these features can be found here.

    DNSSEC for Azure DNS

    I was surprised by this annoucement – maybe I had assumed it was there as it had been available as an AD DNS feature for quite some time. Good to see that its made it up to Azure.

    Key benefits are:

    • Enhanced Security: DNSSEC helps prevent attackers from manipulating or poisoning DNS responses, ensuring that users are directed to the correct websites. 
    • Data Integrity: By signing DNS data, DNSSEC ensures that the information received from a DNS query has not been altered in transit. 
    • Trust and Authenticity: DNSSEC provides a chain of trust from the root DNS servers down to your domain, verifying the authenticity of DNS data. 

    More info on DNSSEC for Azure DNS can be found here.

    Azure Confidential Clean Rooms

    Some fella called Mark Russinovich was talking about this. And when that man talks, you listen.

    Designed for secure multi-party data collaboration, with Confidential Clean Rooms, you can share privacy sensitive data such as personally identifiable information (PII), protected health information (PHI) and cryptographic secrets confidently, thanks to robust trust guarantees that safeguard your data throughout its lifecycle from other collaborators and from Azure operators.

    This secure data sharing is powered by confidential computing, which protects data in-use by performing computations in hardware-based, attested Trusted Execution Environments (TEEs). These TEEs help prevent unauthorized access or modification of application code and data during use. 

    More info can be found here.

    Azure Extended Zones

    Its good to see this feature going into GA and hopefully will provide a pathway for future AEZ’s in other locations.

    Azure Extended Zones are small-footprint extensions of Azure placed in metros, industry centers, or a specific jurisdiction to serve low latency and data residency workloads. They support virtual machines (VMs), containers, storage, and a selected set of Azure services and can run latency-sensitive and throughput-intensive applications close to end users and within approved data residency boundaries. More details here.

    .NET 9

    Final one and slightly cheating here as this was announced at KubeCon the week before – .NET9 has been announced. Note that this is a STS release with an expiry of May 2026. .NET 8 is the current LTS version with an end-of-support date of November 2026 (details on lifecycles for .NET versions here).

    Link to the full release announcement for .NET 9 (including a link to the KubeCon keynote) can be found here.

    Conclusion

    Its good to see that in the firehose of annoucements around AI and Copilot, there there are still some really good enhancements and improvements coming out for Azure services.

    Azure Networking Zero to Hero – Network Security Groups

    In this post, I’m going to stay within the boundaries of our Virtual Network and briefly talk about Network Security Groups, which filter network traffic between Azure resources in an Azure virtual network.

    Overview

    So, its a Firewall right?

    NOOOOOOOOOO!!!!!!!!

    While a Network Security Group (or NSG for short) contains Security Rules to allow or deny inbound/outbound traffic to/from several types of Azure Resources, it is not a Firewall (it may be what a Firewall looked like 25-30 years ago, but not now). NSG’s can be used in conjunction with Azure Firewall and other network security services in Azure to help secure and shape how your traffic flows between subnets and resources.

    Default Rules

    When you create a subnet in your Virtual Network, you have the option to create an NSG which will be automatically associated with the subnet. However, you can also create an NSG and manually associate it with either a subnet, or directly to a Network Interface in a Virtual Machine.

    When an NSG is created, it always has a default set of Security Rules that look like this:

    The default Inbound rules allow the following:

    • 65000 — All Hosts/Resources inside the Virtual Network to Communicate with each other
    • 65001 — Allows Azure Load Balancer to communicate with the Hosts/resources
    • 65500 — Deny all other Inbound traffic

    The default Outbound rules allow the following:

    • 65000 — All Hosts/Resources inside the Virtual Network to Communicate with each other
    • 65001 — Allows all Internet Traffic outbound
    • 65500 — Deny all other Outbound traffic

    The default rules cannot be edited or removed. NSG’s are created initially using a Zero-Trust model. The rules are processed in order of priority (lowest numbered rule is processed first). So you would need to build you rules on top of the default ones (for example, RDP and SSH access if not already in place).

    Configuration and Traffic Flow

    Some important things to note:

    • The default “65000” rules for both Inbound and Outbound – this allows all virtual network traffic. It means that if we have 2 subnets which each have a virtual machine, these would be able to communicate with each other without adding any additional rules.
    • As well as IP addresses and address ranges, we can use Service Tags which represents a group of IP address prefixes from a range of Azure services. These are managed and updated by Microsoft so you can use these instead of having to create and manage multiple Public IP’s for each service. You can find a full list of available Service Tags that can be used with NSG’s at this link. In the image above, “VirtualNetwork” and “AzureLoadBalancer” are Service Tags.
    • A virtual network subnet or interface can only have one NSG, but an NSG can be assigned to many subnets or interfaces. Tip from experience, this is not a good idea – if you have an application design that uses multiple Azure Services, split these services into dedicated subnets and apply NSG’s to each subnet.
    • When using a NSG associated with a subnet and a dedicated NSG associated with a network interface, the NSG associated with the Subnet is always evaluated first for Inbound Traffic, before then moving on to the NSG associated with the NIC. For Outbound Traffic, it’s the other way around — the NSG on the NIC is evaluated first, and then the NSG on the Subnet is evaluated. This process is explained in detail here.
    • If you don’t have a network security group associated to a subnet, all inbound traffic is blocked to the subnet/network interface. However, all outbound traffic is allowed.
    • You can only have 1000 Rules in an NSG by default. Previously, this was 200 and could be raised by logging a ticket with Microsoft, but the max (at time of writing) is 1000. This cannot be increased. Also, there is a max limit of 5000 NSG’s per subscription.

    Logging and Visibility

    • Important – Turn on NSG Flow Logs. This is a feature of Azure Network Watcher that allows you to log information about IP traffic flowing through a network security group,  including details on source and destination IP addresses, ports, protocols, and whether traffic was permitted or denied. You can find more in-depth details on flow logging here, and a tutorial on how to turn it on here.
    • To enhance this, you can use Traffic Analytics, which analyzes Azure Network Watcher flow logs to provide insights into traffic flow in your Azure cloud.

    Conclusion

    NSGs are fundamental to securing inbound and outbound traffic for subnets within an Azure Virtual Network, and form one of the first layers of defense to protect application integrity and reduce the risk of data loss prevention.

    However as I said at the start of this post, an NSG is not a Firewall. The layer 3 and layer 4 port-based protection that NSGs provide has significant limitations and cannot detect other forms of malicious attacks on protocols such as SSH and HTTPS that can go undetected by this type of protection.

    And that’s one of the biggest mistakes I see people make – they assume that NSG’s will do the job because Firewalls and other network security sevices are too expensive.

    Therefore, NSG’s should be used in conjunction with other network security tools, such as Azure Firewall and Web Application Firewall (WAF), for any devices presented externally to the internet or other private networks. I’ll cover these in detail in later posts.

    Hope you enjoyed this post, until next time!!

    Azure Networking Zero to Hero – Routing in Azure

    In this post, I’m going to try and explain Routing in Azure. This is a topic that grows in complexity the more you expand your footprint in Azure in terms of both Virtual Networks, and also the services you use to both create your route tables and route your traffic.

    Understanding Azure’s Default Routing

    As we saw in the previous post when a virtual network is created, this also creates a route table. This contains a default set of routes known as System Routes, which are shown here:

    SourceAddress prefixesNext hop type
    DefaultVirtual Network Address SpaceVirtual network
    Default0.0.0.0/0Internet
    Default10.0.0.0/8None (Dropped)
    Default172.16.0.0/12None (Dropped)
    Default192.168.0.0/16None (Dropped)

    Lets explain the “Next hop types” is in a bit more detail:

    • Virtual network: Routes traffic between address ranges within the address space of a virtual network. So lets say I have a Virtual Network with the 10.0.0.0/16 address space defined. I then have VM1 in a subnet with the 10.0.1.0/24 address range trying to reach VM2 in a subnet with the 10.0.2.0/24 address range. It know to keep this within the Virtual Network and routes the traffic successfully.
    • Internet: Routes traffic specified by the address prefix to the Internet. If the destination address range is not part of a Virtual Network address space, its gets routed to the Internet. The only exception to this rule is if trying to access an Azure Service – this goes across the Azure Backbone network no matter which region the service sits in.
    • None: Traffic routed to the None next hop type is dropped. This automatically includes all Private IP Addresses as defined by RFC1918, but the exception to this is your Virtual Network address space.

    Simple, right? Well, its about to get more complicated …..

    Additional Default Routes

    Azure adds more default system routes for different Azure capabilities, but only if you enable the capabilities:

    SourceAddress prefixesNext hop type
    DefaultPeered Virtual Network Address SpaceVNet peering
    Virtual network gatewayPrefixes advertised from on-premises via BGP, or configured in the local network gatewayVirtual network gateway
    DefaultMultipleVirtualNetworkServiceEndpoint

    So lets take a look at these:

    • Virtual network (VNet) peering: when a peering is created between 2 VNets, Azure adds the address spaces of each of the peered VNets to the Route tables of the source VNets.
    • Virtual network gateway: this happens when S2S VPN or Express Route connectivity is establised and adds address spaces that are advertised from either Local Network Gateways or On-Premises gateways via BGP (Border Gateway Protocol). These address spaces should be summarized to the largest address range coming from On-Premises, as there is a limit of 400 routes per route table.
    • VirtualNetworkServiceEndpoint: this happens when creating a direct service endpoint for an Azure Service, enables private IP addresses in the VNet to reach the endpoint of an Azure service without needing a public IP address on the VNet.

    Custom Routes

    The limitations of sticking with System Routes is that everything is done for you in the background – there is no way to make changes.

    This is why if you need to make change to how your traffic gets routed, you should use Custom Routes, which is done by creating a Route Table. This is then used to override Azure’s default system routes, or to add more routes to a subnet’s route table.

    You can specify the following “next hop types” when creating user-defined routes:

    • Virtual Appliance: This is typically Azure Firewall, Load Balancer or other virtual applicance from the Azure Marketplace. The appliance is typically deployed in a different subnet than the resources that you wish to route through the Virtual Appliance. You can define a route with 0.0.0.0/0 as the address prefix and a next hop type of virtual appliance, with the next hop address set as the internal IP Address of the virtual appliance, as shown below. This is useful if you want all outbound traffic to be inspected by the appliance:
    • Virtual network gateway: used when you want traffic destined for specific address prefixes routed to a virtual network gateway. This is useful if you have an On-Premises device that inspects traffic an determines whether to forward or drop the traffic.
    • None: used when you want to drop traffic to an address prefix, rather than forwarding the traffic to a destination.
    • Virtual network: used when you want to override the default routing within a virtual network.
    • Internet: used when you want to explicitly route traffic destined to an address prefix to the Internet

    You can also use Service Tags as the address prefix instead of an IP Range.

    How Azure selects which route to use?

    When outbound traffic is sent from a subnet, Azure selects a route based on the destination IP address, using the longest prefix match algorithm. So if 2 routes exist with 10.0.0.0/16 and a 10.0.0.0/24, Azure will select the /24 as it has the longest prefix.

    If multiple routes contain the same address prefix, Azure selects the route type, based on the following priority:

    • User-defined route
    • BGP route
    • System route

    So, the initial System Routes are always the last ones to be checked.

    Conclusion and Resources

    I’ve put in some links already in the article. The main place to go for a more in-depth deep dive on Routing is this MS Learn Article on Virtual Network Traffic Routing.

    As regards people to follow, there’s no one better than my fellow MVP Aidan Finn who writes extensively about networking over at his blog. He also delivered this excellent session at the Limerick Dot Net Azure User Group last year which is well worth a watch for gaining a deep understanding of routing in Azure.

    Hope you enjoyed this post, until next time!!