Monitoring – Michael Durkan

What Is Azure Kubernetes Service (AKS) and Why Should You Care?

In every cloud native architecture discussion you have had over the last few years or are going to have in the coming years, you can be guaranteed that someone has or will introduce Kubernetes as a hosting option on which your solution will run.

There’s also different options when Kubernetes enters the conversation – you can choose to run:

Original Kubernetes, with full access to management layers.
Cloud Hypervisor versions such as Amazon EKS, Google Kubernetes Engine or Azure Kubernetes Service (AKS) which abstract the control plane away leaving you to manage worker nodes.
Vendor-specific offerings such as Red Hat Openshift or VMware Tanzu, which can run on both cloud hypervisors or your own choice of underlying infra workload (on-premises, hybrid or cloud-based).
Lightweight versions such as K3s which are useful for scenarios such as Edge or IoT deployments.

Kubernetes promises portability, scalability, and resilience. In reality, operating Kubernetes yourself is anything but simple.

Have you’ve ever wondered whether Kubernetes is worth the complexity—or how to move from experimentation to something you can confidently run in production?

Me too – so let’s try and answer that question. For anyone who knows me or has followed me for a few years knows, I like to get down to the basics and “start at the start”.

This is the first post is of a blog series where we’ll focus on Azure Kubernetes Service (AKS), while also referencing the core Kubernetes offerings as a reference. The goal of this series is:

By the end (whenever that is – there is no set time or number of posts), we will have designed and built a production‑ready AKS cluster, aligned with the Azure Well‑Architected Framework, and suitable for real‑world enterprise workloads.

With the goal clearly defined, let’s start at the beginning—not by deploying workloads or tuning YAML, but by understanding:

Why AKS exists
What problems it solves
When it’s the right abstraction.

What Is Azure Kubernetes Service (AKS)?

Azure Kubernetes Service (AKS) is a managed Kubernetes platform provided by Microsoft Azure. It delivers a fully supported Kubernetes control plane while abstracting away much of the operational complexity traditionally associated with running Kubernetes yourself.

At a high level:

Azure manages the Kubernetes control plane (API server, scheduler, etcd)
You manage the worker nodes (VM size, scaling rules, node pools)
Kubernetes manages your containers and workloads

This division of responsibility is deliberate. It allows teams to focus on applications and platforms rather than infrastructure mechanics.

You still get:

Native Kubernetes APIs
Open‑source tooling (kubectl, Helm, GitOps)
Portability across environments

But without needing to design, secure, patch, and operate Kubernetes from scratch.

Why Should You Care About AKS?

The short answer:

AKS enables teams to build scalable platforms without becoming Kubernetes operators.

The longer answer depends on the problems you’re solving.

AKS becomes compelling when:

You’re building microservices‑based or distributed applications
You need horizontal scaling driven by demand
You want rolling updates and self‑healing workloads
You’re standardising on containers across teams
You need deep integration with Azure networking, identity, and security

Compared to running containers directly on virtual machines, AKS introduces:

Declarative configuration
Built‑in orchestration
Fine‑grained resource management
A mature ecosystem of tools and patterns

However, this series is not about adopting AKS blindly. Understanding why AKS exists—and when it’s appropriate—is essential before we design anything production‑ready.

AKS vs Azure PaaS Services: Choosing the Right Abstraction

Another common—and more nuanced—question is:

“Why use AKS at all when Azure already has PaaS services like App Service or Azure Container Apps?”

This is an important decision point, and one that shows up frequently in the Azure Architecture Center.

Azure PaaS Services

Azure PaaS offerings such as App Service, Azure Functions, and Azure Container Apps work well when:

You want minimal infrastructure management responsibility
Your application fits well within opinionated hosting models
Scaling and availability can be largely abstracted away
You’re optimising for developer velocity over platform control

They provide:

Very low operational overhead – the service is an “out of the box” offering where developers can get started immediately.
Built-in scaling and availability – scaling comes as part of the service based on demand, and can be configured based on predicted loads.
Tight integration with Azure services – integration with tools such as Azure Monitor and Application Insights for monitoring, Defender for Security monitoring and alerting, and Entra for Identity.

For many workloads, this is exactly the right choice.

AKS

AKS becomes the right abstraction when:

You need deep control over networking, runtime, and scheduling
You’re running complex, multi-service architectures
You require custom security, compliance, or isolation models
You’re building a shared internal platform rather than a single application

AKS sits between IaaS and fully managed PaaS:

Azure PaaS abstracts the platform for you. AKS lets you build the platform yourself—safely.

This balance of control and abstraction is what makes AKS suitable for production platforms at scale.

Exploring AKS in the Azure Portal

Before designing anything that could be considered “production‑ready”, it’s important to understand what Azure exposes out of the box – so lets spin up an AKS instance using the Azure Portal.

Step 1: Create an AKS Cluster

Sign in to the Azure Portal
In the search bar at the top, Search for Kubernetes Service

When you get to the “Kubernetes center page”, click on “Clusters” on the left menu (it should bring you here automatically). Select Create, and select “Kubernetes cluster”. Note that there are also options for “Automatic Kubernetes cluster” and “Deploy application” – we’ll address those in a later post.

Choose your Subscription and Resource Group

Enter a Cluster preset configuration, Cluster name and select a Region. You can choose from four different preset configurations which have clear explanations based on your requirements

I’ve gone for Dev/Test for the purposes of spinning up this demo cluster.

Leave all other options as default for now and click “Next” – we’ll revisit these in detail in later posts.

Step 2: Configure the Node Pool

Under Node pools, there is an agentpool automatically added for us. You can change this if needed to select a different VM size, and set a low min/max node count

This is your first exposure to separating capacity management from application deployment.

Step 3: Networking

Under Networking, you will see options for Private/Public Access, and also for Container Networking. This is an important chopice as there are 2 clear options:

Azure CNI Overlay – Pods get IPs from a private CIDR address space that is separate from the node VNet.
Azure CNI Node Subnet – Pods get IPs directly from the same VNet subnet as the nodes.

You also have the option to integrate this into your own VNet which you can specify during the cluster creation process.

Again, we’ll talk more about these options in a later post, but its important to understand the distinction between the two.

Step 4: Review and Create

Select Review + Create – note at this point I have not selected any monitoring, security or integration with an Azure Container Registry and am just taking the defaults. Again (you’re probably bored of reading this….), we’ll deal with these in a later post dedicated to each topic.

Once deployed, explore:

Node pools
Workloads
Services and ingresses
Cluster configuration

Notice how much complexity is hidden – if you scroll back up to the “Azure-managed v Customer-managed” diagram, you have responsibility for managing:

Cluster nodes
Networking
Workloads
Storage

Even though Azure abstracts away responsibility for things like key-value store, scheduler, controller and management of the cluster API, a large amount of responsibility still remains.

What Comes Next in the Series

This post sets the foundation for what AKS is and how it looks out of the box using a standard deployment with the “defaults”.

Over the course of the series, we’ll move through the various concepts which will help to inform us as we move towards making design decisions for production workloads:

Kubernetes Architecture Fundamentals (control plane, node pools, and cluster design), and how they look in AKS
Networking for Production AKS (VNets, CNI, ingress, and traffic flow)
Identity, Security, and Access Control
Scaling, Reliability, and Resilience
Cost Optimisation and Governance
Monitoring, Alerting and Visualizations
Alignment with the Azure Well Architected Framework
And lots more ……

See you on the next post!

100 Days of Cloud – Day 71: Microsoft Sentinel

Its Day 71 of my 100 Days of Cloud journey, and todays post is all about Microsoft Sentinel. This is the new name for Azure Sentinel, following on from the rebranding of a number of Microsoft Azure services at Ignite 2021.

Microsoft Sentinel is a cloud-native Security Information and Event Management (SIEM) and Security Orchestration, Automation, and Response (SOAR) solution. It provides intelligent security analytics and threat intelligence across the enterprise, providing a single solution for attack detection, threat visibility, proactive hunting, and threat response.

SIEM and SOAR

We briefly touched on SIEM and SOAR in the previous post on Microsoft Defender for Cloud. Before we go further, lets note what the definition of SIEM and SOAR is according to Gartner:

Security information and event management (SIEM) technology supports threat detection, compliance and security incident management through the collection and analysis (both near real time and historical) of security events, as well as a wide variety of other event and contextual data sources. The core capabilities are a broad scope of log event collection and management, the ability to analyze log events and other data across disparate sources, and operational capabilities (such as incident management, dashboards and reporting).
SOAR refers to technologies that enable organizations to collect inputs monitored by the security operations team. For example, alerts from the SIEM system and other security technologies — where incident analysis and triage can be performed by leveraging a combination of human and machine power — help define, prioritize and drive standardized incident response activities. SOAR tools allow an organization to define incident analysis and response procedures in a digital workflow format.

Overview of Sentinel Functionality

Microsoft Sentinel gives a single view of your entire estate across multiple devices, users, applications and infrastructure across both on-premise and multiple cloud environments. The key features are:

Collect data at cloud scale across all users, devices, applications, and infrastructure, both on-premises and in multiple clouds.
Detect previously undetected threats, and minimize false positives using Microsoft’s analytics and unparalleled threat intelligence.
Investigate threats with artificial intelligence, and hunt for suspicious activities at scale, tapping into years of cyber security work at Microsoft.
Respond to incidents rapidly with built-in orchestration and automation of common tasks.

Sentinel can ingest alerts from not just Microsoft solutions such as Defender, Office365 and Azure AD, but from a multitude of 3rd-party and multi cloud providers such as Akamai, Amazon, Barracuda, Cisco, Fortinet, Google, Qualys and Sophos (and thats just to name a few – you can find a full list here). These are whats known as Data Sources and the data is ingested using the wide range of built-in connectors that are available:

Once your data sources are connected, the data is monitored using Sentinel integration with Azure Monitor Workbooks, which allows you to visualize your data:

Once the data and workbooks are in place, Sentinel uses analytics and machine learning rules to map your network behaviour and to combine multiple related alerts into incidents which you can view as a group to investigate and resolve possible threats. The benefit here is that Sentinel lowers the noise that is created by multiple alerts and reduces the number of alerts that you need to react to:

Sentinel’s autotmation and orchestration playbooks are built on Azure Logic Apps, and there is growing gallery of built-in playbooks to choose from. These are based on standard and repeatable events, and in the same way as standard Logic Apps are triggered by a particular action or event:

Last but not least, Sentinel has investigation tools that go deep to find the root cause and scope of a potential security threat, and hunting tools based on the MITRE Framework which enable you to hunt for threats across your organization’s data sources before an event is triggered.

Do I need both Defender for Cloud and Sentinel?

My advice on this is yes – because they are 2 different products that integrate and complement each other

Sentinel has the ability to detect, investigate and remediate threats. In order for Sentinel to do this, it needs a stream of data from Defender for Cloud or other 3rd party solutions.

Conclusion

We’ve seen how powerful Microsoft Sentinel can be as a tool to protect your entire infrastructure across multiple providers and platforms. You can find more in-depth details on Microsoft Sentinel here.

Hope you enjoyed this post, until next time!

100 Days of Cloud – Day 70: Microsoft Defender for Cloud

Its Day 70 of my 100 Days of Cloud journey, and todays post is all about Azure Security Center! There’s one problem though, its not called that anymore ….

At Ignite 2021 Fall edition, Microsoft announced that the Azure Security Center and Azure Defender products were being rebranded and merged into Microsoft Defender for Cloud.

Overview

Defender for Cloud is a cloud-based tool for managing the security of your multi-vendor cloud and on-premises infrastructure. With Defender for Cloud, you can:

Assess: Understand your current security posture using Secure score which tells you your current security situation: the higher the score, the lower the identified risk level.
Secure: Harden all connected resources and services using either detailed remediation steps or an automated “Fix” button.
Defend: Detect and resolve threats to those resources and services, which can be sent as email alerts or streamed to SIEM (Security, Information and Event Management), SOAR (Security Orchestration, Automation, and Response) or IT Service Management solutions as required.

Pillars

Microsoft Defender for Cloud’s features cover the two broad pillars of cloud security:

Cloud security posture management

CSPM provides visibility to help you understand your current security situation, and hardening guidance to help improve your security.

Central to this is Secure Score, which continuously assesses your subscriptions and resources for security issues. It then presents the findings into a single score and provides recommended actions for improvement.

The guidance in Secure Score is provided by the Azure Security Benchmark, and you can also add other standards such as CIS, NIST or custom organization-specific requirements.

Cloud workload protection

Defender for Cloud offers security alerts that are powered by Microsoft Threat Intelligence. It also includes a range of advanced, intelligent, protections for your workloads. The workload protections are provided through Microsoft Defender plans specific to the types of resources in your subscriptions.

The Defender plans page of Microsoft Defender for Cloud offers the following plans for comprehensive defenses for the compute, data, and service layers of your environment:

Microsoft Defender for servers

Microsoft Defender for Storage

Microsoft Defender for SQL

Microsoft Defender for Containers

Microsoft Defender for App Service

Microsoft Defender for Key Vault

Microsoft Defender for Resource Manager

Microsoft Defender for DNS

Microsoft Defender for open-source relational databases

Microsoft Defender for Azure Cosmos DB (Preview)

Azure, Hybrid and Multi-Cloud Protection

Defender for Cloud is an Azure-native service, so many Azure services are monitored and protected without the need for agent deployment. If agent deployment is needed, Defender for Cloud can deploy Log Analytics agent to gather data. Azure-native protections include:

Azure PAAS: Detect threats targeting Azure services including Azure App Service, Azure SQL, Azure Storage Account, and more data services.
Azure Data Services: automatically classify your data in Azure SQL, and get assessments for potential vulnerabilities across Azure SQL and Storage services.
Networks: reducing access to virtual machine ports, using the just-in-time VM access, you can harden your network by preventing unnecessary access.

For hybrid environments and to protect your on-premise machines, these devices are registered with Azure Arc (which we touched on back on Day 44) and use Defender for Cloud’s advanced security features.

For other cloud providers such as AWS and GCP:

Defender for Cloud CSPM features assesses resources according to AWS or GCP’s according to their specific security requirements, and these are reflected in your secure score recommendations.
Microsoft Defender for servers brings threat detection and advanced defenses to your Windows and Linux EC2 instances. This plan includes the integrated license for Microsoft Defender for Endpoint amongst other features.
Microsoft Defender for Containers brings threat detection and advanced defenses to your Amazon EKS and Google’s Kubernetes Engine (GKE) clusters.

We can see in the screenshot below how the Defender for Cloud overview page in the Azure Portal gives a full view of resources across Azure and multi cloud sunscriptions, including combined Secure score, Workload protections, Regulatory compliance, Firewall manager and Inventory.

Conclusion

You can find more in-depth details on how Microsoft Defender for Cloud can protect your Azure, Hybrid and Multi-Cloud Workloads here.

Hope you enjoyed this post, until next time!

100 Days of Cloud – Day 61: Azure Monitor Metrics and Logs

Its Day 61 of my 100 Days of Cloud journey, and today I’m continuing to look at Azure Monitor, and am going to dig deeper into Azure Monitor Metrics and Azure Monitor Logs.

In our high level overview diagram, we saw that Metrics and Logs are the Raw Data that has been collected from the data sources.

Lets take a quick look at both options and what they are used for, as that will give us an insight into why we need both of them!

Azure Monitor Metrics

Azure Monitor Metrics collects data from monitored resources and stores the data in a time series database (for an OpenSource equivalent, think InfluxDB). Metrics are numerical values that are collected at regular intervals and describe some aspect of a system at a particular time.

Each set of metric values is a time series with the following properties:

The time that the value was collected.
The resource that the value is associated with.
A namespace that acts like a category for the metric.
A metric name.
The value itself.

Once our metrics are collected, there are a number of options we have for using them, including:

Analyze – Use Metrics Explorer to analyze collected metrics on a chart and compare metrics from various resources.
Alert – Configure a metric alert rule that sends a notification or takes automated action when the metric value crosses a threshold.
Visualize – Pin a chart from Metrics Explorer to an Azure dashboard, or export the results of a query to Grafana to use its dashboarding and combine with other data sources.
Automate – Increase or decrease resources based on a metric value crossing a threshold.
Export – Route metrics to logs to analyze data in Azure Monitor Metrics together with data in Azure Monitor Logs and to store metric values for longer than 93 days.
Archive – Archive the performance or health history of your resource for compliance, auditing, or offline reporting purposes.

Azure Monitor can collect metrics from a number of sources:

Azure Resources – gives visibility into their health and performance over a period of time.
Applications – detect performance issues and track trends in how the application is being used.
Virtual Machine Agents – collect guest OS metrics from Windows or Linux VMs.
Custom Metrics can also be defined for an app thats monitored by Application Insights.

We can use Metrics Explorer to analyze the metric data and chart the values over time.

When it comes to retention,

Platform metrics are stored for 93 days.
Guest OS Metrics sent to Azure Monitor Metrics are stored for 93 days.
Guest OS Metrics collected by the Log Analytics agent are stored for 31 days, and can be extended up to 2 years.
Application Insight log-based metrics are variable and depend on the events in the underlying logs (31 days to 2 years).

You can find more details on Azure Monitor Metrics here.

Azure Monitor Logs

Azure Monitor Logs collects and organizes log and performance data from monitored resources. Log Data is stored in a structured format which can them be queried using a query language called Kusto Query Language (KQL).

Once our logs are collected, there are a number of options we have for using them, including:

Analyze – Use Log Analytics in the Azure portal to write log queries and interactively analyze log data by using a powerful analysis engine.
Alert – Configure a log alert rule that sends a notification or takes automated action when the results of the query match a particular result.
Visualize –
- Pin query results rendered as tables or charts to an Azure dashboard.
- Export the results of a query to Power BI to use different visualizations and share with users outside Azure.
- Export the results of a query to Grafana to use its dashboarding and combine with other data sources.
Get insights – Logs support insights that provide a customized monitoring experience for particular applications and services.
Export – Configure automated export of log data to an Azure storage account or Azure Event Hubs, or build a workflow to retrieve log data and copy it to an external location by using Azure Logic Apps.

You need to create a Log Analytics Workspace in order to store the data. You can use Log Analytics Workspaces for Azure Monitor, but also to store data from other Azure services such as Sentinel or Defender for Cloud in the same workspace.

Each workspace contains multiple tables that are organized into separate columns with multiple rows of data. Each table is defined by a unique set of columns. Rows of data provided by the data source share those columns. Log queries define columns of data to retrieve and provide output to different features of Azure Monitor and other services that use workspaces.

You can the use Log Analytics to edit and run log queries and to anaylze the output. Log queries are the method of retrieving data from the Log Analytics Workspace, these are written in Kusto Query Language (KQL). You can write log queries in Log Analytics to interactively analyze their results, use them in alert rules to be proactively notified of issues, or include their results in workbooks or dashboards.

You can learn about KQL in more detail here, and find more details about Azure Monitor Logs here.

Conclusion

And thats a brief look at Azure Monitor Metric and Logs. We can see the differences between them, but how they can work together to build a powerful monitoring stack that can go right down to automating fixes for the alerts as they happen!

Hope you enjoyed this post, until next time!

Monitoring with Grafana and InfluxDB using Docker Containers — Part 4: Install and Use Telegraf with PowerShell, send data to InfluxDB, and get the Dashboard working!

This post originally appeared on Medium on May 14th 2021

Welcome to Part 4 and the final part of my series on setting up Monitoring for your Infrastructure using Grafana and InfluxDB.

This image has an empty alt attribute; its file name is 0*gj-SHaUJ-slesruN

Last time, we set up InfluxDB as our Datasource for the data and metrics we’re going to use in Grafana. We also download the JSON for our Dashboard from the Grafana Dashboards Site and import this into Grafana instance. This finished off the groundwork of getting our Monitoring System built and ready for use.

In the final part, I’ll show you how to install the Telegraf Data collector agent on our WSUS Server. I’ll then configure the telgraf.conf file to query a PowerShell script, which will in turn send all collected metrics back to our InfluxDB instance. Finally, I’ll show you how to get the data from InfluxDB to display in our Dashboard.

Telegraf Install and Configuration on Windows

Telegraf is a plugin-driven server agent for collecting and sending metrics and events from databases, systems, and IoT sensors. It can be downloaded directly from the InfluxData website, and comes in version for all OS’s (OS X, Ubuntu/Debian, RHEL/CentOS, Windows). There is also a Docker image available for each version!

To download for Windows, we use the following command in Powershell:

wget https://dl.influxdata.com/telegraf/releases/telegraf-1.18.2_windows_amd64.zip -UseBasicParsing -OutFile telegraf-1.18.2_windows_amd64.zip

This downloads the file locally, you then use this command to extract the archive to the default destination:

Expand-Archive .\telegraf-1.18.2_windows_amd64.zip -DestinationPath 'C:\Program Files\InfluxData\telegraf\'

Once the archive gets extracted, we have 2 files in the folder: telegraf.exe, and telegraf.conf:

Telegraf.exe is the Data Collector Service file and is natively supported running as a Windows Service. To install the service, run the following command from PowerShell:

C:\"Program Files"\InfluxData\Telegraf\Telegraf-1.18.2\telegraf.exe --service install

This will install the Telegraf Service, as shown here under services.msc:

Telegraf.conf is the parameter file, and telegraf.conf reads that to see what metrics it needs to collect and send to the specified destination. The download I did above contains a template telegraf.conf file which will return the recommended Windows system metrics.

To test that the telgraf is working, we’ll run this command from the directory where telegraf.exe is located:

.\telegraf.exe --config telegraf.conf --test

As we can see, this is running telgraf.exe and specifying telgraf.conf as its config file. This will return this output:

This shows that telegraf can collect data from the system and is working correctly. Lets get it set up now to point at our InfluxDB. To do this, we open our telgraf.conf file and go to the [[outputs.influxdb]] section where we add this info:

[[outputs.influxdb]]

urls = ["http://10.210.239.186:8086"] 
  database = "telegraf" 
  precision = "s"
  timeout = "10s"

This is specifying the url/port and database where we want to send the data to. This is the basic setup for telegraf.exe, next up I’ll get it working with our PowerShell script so we can send our WSUS Metrics into InfluxDB.

Using Telegraf with PowerShell

As a prerequisite, we’ll need to install the PoshWSUS Module on our WSUS Server, which can be downloaded from here.

Once this is installed, we can download our WSUS PowerShell script. The link to the script can be found here. If we look at the script, its going to do the following:

Get a count of all machines per OS Version
Get the number of updates pending for the WSUS Server
Get a count of machines that need updates, have failed updates, or need a reboot
Return all of the above data to the telegraf data collector agent, which will send it to the InfluxDB.

Before doing any integration with Telegraf, modify the script to your needs using PowerShell ISE (on line 26, you need to specify the FQDN of your own WSUS Server), and then run the script to make sure it returns the data you expect. The result will look something like this

This tells us that the script works. Now we can integrate the script into our telegraf.conf file. Underneath the “Inputs” section of the file, add the following lines:

####################################################################
#                        INPUTS                                    #
####################################################################

[[inputs.exec]]
  commands = ["powershell C:/temp/wsus-stats.ps1"]
  name_override = "wsusstats"
  interval = "300s"
  timeout = "300s"
  data_format = "influx"

This is telling our telegraf.exe service to call PowerShell to run our script at an interval of 300 seconds, and return the data in “influx” format.

Now once we save the changes, we can test our telegraf.conf file again to see if it returns the data from the PowerShell script as well as the default Windows metrics. Again, we run:

.\telegraf.exe --config telegraf.conf --test

And this time, we should see the WSUS results as well as the Windows Metrics:

And we do! Great, and at this point, we can now change our Telegraf Service that we installed earlier to “Running” by running this command:

net start telegraf

Now that we have this done, lets get back into Grafana and see if we can get some of this data to show in the Dashboard!

Configuring Dashboards

In the last post, we imported our blank dashboard using our json file.

Now that we have our Telegraf Agent and PowerShell script working and sending data back to InfluxDB, we can now start configuring the panels on our dashboard to show some data.

For each of the panels on our dashboard, clicking on the title at the top reveals a dropdown list of actions.

As you can see, there are a number of actions you can take (including removing a panel if you don’t need it), however we’re going to click on “Edit”. This brings us into a view where we get access to modify the properties of the Query, and also can modify some Dashboard settings including the Title and color’s to show based on the data that is being returned:

The most important thing for use in this screen is the query

As you can see, in the “FROM” portion of the query, you can change the values for “host” to match the hostname of your server. Also, from the “SELECT” portion, you can change the field() to match the data that you need to have represented on your panel. If we take a look at this field and click, it brings us a dropdown:

Remember where these values came from? These are the values that we defined in our PowerShell script above. When we select the value we want to display, we click “Apply” at the top right of the screen to save the value and return to the Main Dashboard:

And there’s our value displayed! Lets take a look at one of the default Windows OS Metrics as well, such as CPU Usage. For this panel, you just need to select the “host” where you want the data to be displayed from:

And as we can see, its gets displayed:

There’s a bit of work to do in order to get the dashboard to display all of the values on each panel, but eventually you’ll end up with something looking like this:

As you can see, the data on the graph panels is timed (as this is a time series database), and you can adjust the times shown on the screen by using the time period selector at the top right of the Dashboard:

The final thing I’ll show you is if you have multiple Dashboards that you are looking to display on a screen, Grafana can do this by using the “Playlists” option under Dashboards.

You can also create Alerts to go to multiple sources such as Email, Teams Discord, Slack, Hangouts, PagerDuty or a webhook.

Conclusion

As you have seen over this post, Grafana is a powerful and useful tool for visualizing data. The reason for using this is conjunction with InfluxDB and Telegraf is that it had native support for Windows which was what we needed to monitor.

You can use multiple data sources (eg Prometheus, Zabbix) within the same Grafana instance depending on what data you want to visualize and display. The Grafana Dashboards site has thousands of community and official Dashboards for multiple systems such as AWS, Azure, Kubernetes etc.

While Grafana is a wonderful tool, its should be used as part of your monitoring infrastructure. Dashboards provide a great “birds-eye” view of the status of your Infrastructure, but you should use these in conjunction with other tools and processes, such as using alerts to generate tickets or self-healing alerts based on thresholds.

Thanks again for reading, I hope you have enjoyed the series and I’ll see you on the next one!

Monitoring with Grafana and InfluxDB using Docker Containers — Part 3: Datasource Configuration and Dashboard Installation

This post originally appeared on Medium on May 5th 2021

Welcome to Part 3 of my series on setting up Monitoring for your Infrastructure using Grafana and InfluxDB.

Last time, we downloaded our Docker Images for Grafana and InfluxDB, created persistent storage for them to persist our data, and also configured our initial Influx Database that will hold all of our Data.

In Part 3, we’re going to set up InfluxDB as our Datasource for the data and metrics we’re going to use in Grafana. We’ll also download the JSON for our Dashboard from the Grafana Dashboards Site and import this into Grafana instance. This will finish off the groundwork of getting our Monitoring System built and ready for use.

Configure your Data Source

Now we have our InfluxDB set up, we’re ready to configure it as a Data source in Grafana. So we log on to the Grafana console. Click the “Configuration” button (looks like a cog wheel) on the left hand panel, and select “Data Sources”

This is the main config screen for the Grafana Instance. Click on “Add data source”

Search for “influxdb”. Click on this and it will add it as a Data Source:

We are now in the screen for configuring our InfluxDB. We configure the following options:
Query Language — InfluxQL. (there is an option for “Flux”, however this is only used by InfluxDB versions newer than 1.8)
URL — this is the Address of our InfluxDB container instance. Don’t forget to specify the port as 8086.
Access — This will always be Server
Auth — No options needed here

Finally, we fill in our InfluxDB details:
Database — this is the name that we defined when setting up the database, in our case telegraf
User — this is our “johnboy” user
Password — This is the password
Click on “Save & Test”. This should give you a message saying that the Data source is working — this means you have a successful connection between Grafana and InfluxDB.

Great, so now we have a working connection between Grafana and InfluxDB

Dashboards

We now have our Grafana instance and our InfluxDB ready. So now we need to get some data into our InfluxDB and use this in some Dashboards. The Grafana website (https://grafana.com/grafana/dashboards) has hundreds of official and community build dashboards.

As a reminder, the challenge here is to visualize WSUS … yes, I know WSUS. As in Windows Server Update Services. Sounds pretty boring doesn’t it? It’s not really though — the problem is that unless WSUS is integrated with the likes of SCCM, SCOM or some other 3rd party tools (all of which will incur Licensing Costs), it doesn’t really have a good way of reporting and visualizing its content in a Dashboard.

I’ll go to the Grafana Dashboards page and search for WSUS. We can also search by Data Source.

When we click into the first option, we can see that we can “Download JSON”

Once this is downloaded, lets go back to Grafana. Open Dashboards, and click “Import”:

Then we can click “Upload JSON File” and upload our downloaded json. We can also import directly from the Grafana website using the Dashboard ID, or else paste the JSON directly in:

Once the JSON is uploaded, you then get the screen below where you can rename the Dashboard, and specify what Data Source to use. Once this is done, click “Import”:

And now we have a Dashboard. But there’s no data! That’s the next step, we need to configure our WSUS Server to send data back to the InfluxDB.

Next time …..

Thanks again for reading! Next time will be the final part of our series, where we’ll install the Telegraf agent on our WSUS Server, use it to run a PowerShell script which will send data to our InfluxDB, and finally bring the data from InfluxDB into our Grafana Dashboard.

Hope you enjoyed this post, until next time!!

Monitoring with Grafana and InfluxDB using Docker Containers — Part 2: Docker Image Pull and Setup

This post originally appeared on Medium on April 19th 2021

Welcome to Part 2 of my series on setting up Monitoring for your Infrastructure using Grafana and InfluxDB.

Last week as well as the series Introduction, we started our Monitoring build with Part 1, which was creating our Ubuntu Server to serve as a host for our Docker Images. Onwards we now go to Part 2, where the fun really starts and we pull our images for Grafana and InfluxDB from Docker Hub, create persistent storage and get them running.

Firstly, lets get Grafana running!

We’re going to start by going to the official Grafana Documentation (link here) which tells us that we need to create a persistent storage volume for our container. If we don’t do this, all of our data will be lost every time the container shuts down. So we run sudo docker volume create grafana-storage:

That’s created, but where is it located? Run this command to find out: sudo find / -type d -name “grafana-storage

This tells us where the file location is (in this case, the location as we can see above is:

var/snap/docker/common/var-lib-docker/volumes/grafana-storage

Now, we need to download the Grafana image from the docker hub. Run sudo docker search grafana to search for a list of Grafana images:

As we can see, there are a number of images available but we want to use the official one at the top of the list. So we run sudo docker pull grafana/grafana to pull the image:

This will take a few seconds to pull down. We run the sudo docker images command to confirm the image has downloaded:

Now the image is downloaded and we have our storage volume ready to persist our data. Its time to get our image running. Lets run this command:

sudo docker run -d -p 3000:3000 — name=grafana -v grafana-storage:var/snap/docker/common/var-lib-docker/volumes/grafana-storage grafana/grafana

Wow, that’s a mouthful ….. lets explain what the command is doing. We use “docker run -d” to start the container in the background. We then use the “-p 3000:3000” to make the container available on port 3000 via the IP Address of the Ubuntu Host. We then use “-v” to point at our persistent storage location that we created, and finally we use “grafana/grafana” to specify the image we want to use.
The IP of my Ubuntu Server is 10.210.239.186. Lets see if we can browse to 10.210.239.186:3000 …..

Well hello there beautiful ….. the default username/password is admin/admin, and you will be prompted to change this at first login to something more secure.

Now we need a Data Source!

Now that we have Grafana running, we need a Data Source to store the data that we are going to present via our Dashboard. There are many excellent data sources available, the question is which one to use. That can be answered by going to the Grafana Dashboards page, where you will find thousands of Official and Community built dashboards. By searching for the Dashboard you want to create, you’ll quickly see the compatible Data Source for your desired dashboard. So if you recall, we are trying to visualize WSUS Metrics, and if we search for WSUS, we find this:

As you can see, InfluxDB is the most commonly used, so we’re going to use that. But what is this “InfluxDB” that I speak of.
InfluxDB is a “time series database”. The good people over at InfluxDB explain it a lot better than I will, but in summary a time series database is optimized for time-stamped data that can be tracked, monitored and sampled over time.
I’m going to keep using docker for hosting all elements of our monitoring solution. Lets search for the InfluxDB image on the Docker Hub by running sudo docker search influx:

Again, I’m going to use the official one, so run the sudo docker pull influxdb:1.8 command to pull the image. Note that I’m pulling the InfluxDB image with tag 1.8. Versions after 1.8 use a new DB Model which is not yet widely used:

And to confirm, lets run sudo docker images:

At this point, I’m ready to run the image. But first, lets create another persistent storage area on the host for the InfluxDB image, just like I did for the Grafana one. So we run sudo docker volume create influx18-storage:

Again, lets run the command to find it and get the exact location:

And this is what we need for our command to launch the container:

sudo docker run -d -p 8086:8086 — name=influxdb -v influx18-storage:var/snap/docker/common/var-lib-docker/volumes/influx18-storage influxdb:1.8

We’re running InfluxDB on port 8086 as this is its default. So now, lets check our 2 containers are running by running sudo docker ps:

OK great, so we have our 2 containers running. Now, we need to interact with the InfluxDB Container to create our database. So we run sudo docker exec -it 99ce /bin/bash:

This gives us an interactive session (docker exec -it) with the container (we’ve used the container ID “99ce” from above to identify it) so we can configure it. Finally, we’ve asked for a bash session (/bin/bash) to run commands from. So now, lets create our database and set authentication. We run “influx” and setup our database and user authentication:

Next time….

Great! So now that’s done , we need to configure InfluxDB as a Data Source for Grafana. You’ll have to wait for Part 3 for that! Thanks again for reading, and hope to see you back next week where as well as setting up our Data Source connection, we’ll set up our Dashboard in Grafana ready to receive data from our WSUS Server!

Hope you enjoyed this post, until next time!!

Monitoring with Grafana and InfluxDB using Docker Containers — Part 1: Set up Ubuntu Docker Host

This post originally appeared on Medium on April 12th 2021

Welcome to the first part of the series where I’ll show you how to set up Monitoring for your Infrastructure using Grafana and InfluxDB. Click here for the introduction to the series.

I’m going to use Ubuntu Server 20.04 LTS as my Docker Host. For the purpose of this series, this will be installed as a VM on Hyper-V. There are a few things you need to know for the configuration:

Ubuntu can be installed as either a Gen1 or Gen2 VM on Hyper-V. For the purposes of this demo, I’ll be using Gen2.
Once the VM has been provisioned, you need to turn off Secure Boot, as shown here

Start the VM, and you will be prompted to start the install. Select “Install Ubuntu Server”:

The screen then goes black as it runs the integrity check of the ISO:

Select your language…..

…..and Keyboard layout:

Next, add your Network Information. You can also choose to “Continue without network” if you wish and set this up later in the Ubuntu OS:

You then get the option to enter a Proxy Address if you need to:

And then an Ubuntu Archive Mirror — this can be left as default:

Next, we have the Guided Storage Configuration Screen. You can choose to take up the entire disk as default, or else go for a custom storage layout. As a best practice, its better to keep your boot, swap, var and root filesystems on different partitions (an excellent description of the various options can be found here). So in this case, I’m going to pick “Custom storage layout”:

On the next screen, you need to create your volume groups for boot/swap/var/root. As shown below, I go for the following:
boot — 1GB — if Filesystems become very large (eg over 100GB), boot sometimes has problems seeing files on these larger drives.
swap — 2GB — this needs to be at least equal to the amount of RAM assigned. This is equivalent to the paging files on a Windows File System.
var — 40GB — /var contains kernel log files and also application log files.
root — whatever is left over, this should be minimum 8GB, 15GB or greater is recommended.
Once you have all of your options set up, select “Done”:

Next, you get into Profile setup screen where you set up the root username and password:

Next, you are prompted to install OpenSSH to allow remote access.

Next, we get to choose to install additional “popular” software. In this case, I’m choosing to install docker as we will need it later to run our Grafana and InfluxDB container instances:

And finally, we’re installing!! Keep looking at the top where it will say “Install Complete”. You can then reboot.

And we’re in!! As you can see, the system is telling us there are are 23 updates that can be installed:

So lets run the command “sudo apt list — upgradeable” and see what updates are available:

All looks good, so lets run the “sudo apt-get upgrade” command to upgrade all:

The updates will complete, and this will also install Docker as we had requested during the initial setup. Lets check to make sure its there by running “sudo docker version”:

Next Time ….

Thanks for taking the time to read this post. I’d love to hear your thoughts on this, and I hope to see you back next week when we download the Grafana and InfluxDB Docker images and configure them to run on our host.

Hope you enjoyed this post, until next time!!

Monitoring with Grafana and InfluxDB using Docker Containers — Introduction

This post originally appeared on Medium on April 12th 2021

Welcome to a series where I’ll show you how to set up Monitoring for your Infrastructure using Grafana and InfluxDB.

A little bit about Monitoring ….

Monitoring is one of the most important parts of any infrastructure setup, whether On-Premise, Hybrid or Cloud based. Not only can it help with outages, performance and security , its also used for help in design and scaling of your infrastructure.

Traditionally, monitoring systems comprise of 3 components:

An agent to collect data from a source (this source can be an Operating System, Database, Application, Website or a piece of Hardware)
A central database to store the data collected by all of the agents
A website or application to visualize the data into a readable format

In the example shown below, the components are:

Windows (Operating System, which is the Source)
Telegraf (Data Collection Agent)
InfluxDB (Time Series Database to store data sent by the Agent)
Grafana (System to visualize the data in the database)

The Challenge

I was given a challenge to provide visualization for Microsoft Windows Server Update Services (WSUS). Anyone who uses this console knows that it hasn’t changed all that much since it was originally released way back in 2005, and any of the built in reporting leaves a lot to be desired:

Ugh …. there has to be a better way to do this …. And there is!!!

How I’ll build it!

To make things more interesting, I’m going to run Grafana and InfluxDB using Docker containers running on an Ubuntu Docker Host VM. Then we’re going to monitor a Microsoft WSUS Server with Telegraf Agent installed!

During the series, I’ll be showing you how to build the system from scratch using these steps:

Click on each of the links to go to the post — I’ll update the links as each post is released

Next time ….

Click here to go to the first step in building our Monitoring system, building our Ubuntu Docker Host

Hope you enjoyed this post, until next time!!