Monitoring batch workloads with Application Insights

Views (3963)

Kenny Saelen

Like(1)

Report

Introduction

In Dynamics 365 Finance & Supply Chain Management we have the ability to use the Monitoring and Telemetry feature to send application telemetry to Microsoft Application Insights. If you are new to this feature, please visit the following documentation link to get an overview of the capabilities and how to get started.
Monitoring and Telemetry with Microsoft Dynamics 365 Finance and Microsoft Dynamics 365 Supply Chain Management | Microsoft Learn

Every organization that is using Dynamics 365 Finance & Supply Chain Management today has critical workloads running in the Batch framework. Having telemetry on the execution of these batch jobs is essential to the operations teams administering and monitoring these workloads. Starting from version 10.0.45 (7.0.7690.21 PU69), we are adding this additional capability to monitor batch workloads via Application Insights. In addition, we are also providing a backport for version 10.0.44 (7.0.7606.126 PU68).

Enable batch telemetry

To get started with batch monitoring, the following flights need to be enabled:

BatchTelemetryConfigurationFlight
BatchThreadInfoTelemetryFlight
BatchTelemetryCallstackFlight

On developer machines you can add the flights to the SysFlighting table. Please note that this is currently in private preview. You will only be able to test this in sandboxes when this enters public preview. The current expected timeline for Public Preview is September 2025, but this is subject to change based on the outcomes of the Private Preview.

Once the flighting configuration is in place, new parameters will be visible on the Monitoring and Telemetry parameters form in the Configure tab page.

Querying the batch telemetry

If you are new to using Application Insights please first have a look at the following article to get you started with KQL queries to review telemetry: Analyze and monitor telemetry with KQL - Finance & Operations | Dynamics 365 | Microsoft Learn .

We have currently added the following signals for you to query. All of the queries you find below are already packaged into a sample dashboard which is available through our Github repository. More details on downloading and using the sample dashboard can be found further in this post in the ‘FastTrack sample accelerator dashboard for batch’ section.

Batch start and stop times

Provides information about the starting and completion time for batch jobs. This is a crucial part of batch telemetry as it provides a way of calculating how long batch jobs are taking to complete. Consider the following query that captures the start times for batch jobs, includes the completion times based on the same ActivityId and provides the completion times.

customEvents
| where timestamp between (_startTime .. _endTime)
| where name in ("BatchTaskStart","BatchTaskFinished","BatchTaskFailure")
| extend CustomDimensionsParsed = parse_json(customDimensions)
| extend InfoMessageParsed = parse_json(tostring(CustomDimensionsParsed.InfoMessage))
| extend ActivityId = tostring(CustomDimensionsParsed.activityId)
| extend ClassName = tostring(CustomDimensionsParsed.ClassName)
| extend BatchJobId = tostring(CustomDimensionsParsed.BatchJobId)
| extend BatchJobTaskId1 = tostring(CustomDimensionsParsed.BatchJobTaskId)
| extend BatchJobTaskId2 = tostring(CustomDimensionsParsed.BatchTaskId)
| extend BatchJobTaskId = iif(isnotempty(BatchJobTaskId1), BatchJobTaskId1, BatchJobTaskId2)
| extend StartTime = iff(name == "BatchTaskStart" , timestamp, datetime(null))
| extend EndTime = iff(name == "BatchTaskFinished", timestamp, datetime(null))
| extend ErrorTime = iff(name == "BatchTaskFailure" , timestamp, datetime(null))
| extend RetryCount = iff(name == "BatchTaskStart" , InfoMessageParsed.RetryCount, "")
| project StartTime, EndTime,ErrorTime,RetryCount, ActivityId, ClassName, BatchJobId, RoleInstance = cloud_RoleInstance, BatchJobTaskId
| where isempty(_batchJobId) or tostring(BatchJobId) in (_batchJobId)
| summarize StartTime = min(StartTime),
CompletionTime = max(EndTime),
ErrorTime = max(ErrorTime),
RetryCount = take_any(RetryCount),
ClassName = any(ClassName),
BatchJobId = any(BatchJobId),
BatchJobTaskId = any(BatchJobTaskId),
RoleInstance = any(RoleInstance)
by ActivityId
| where isnotempty(StartTime)
| extend BatchAOS = strcat(split(RoleInstance, "-")[0])
| project ActivityId, BatchAOS, BatchJobId, BatchJobTaskId, ClassName, StartTime, CompletionTime, ElapsedTime = CompletionTime - StartTime, RetryCount

Batch throttling

Provides information about throttling for batch workloads. This enables customers to troubleshoot whether batch jobs were throttled, and how the system metrics were during during throttling. (CPU, Memory, SQL DTU). The following query can be used to get the running tasks and throttled tasks over time.

let _scale = '10m';
customEvents
| where timestamp between (_startTime .. _endTime)
| where name in ("BatchThrottled", " BatchTaskStart")
| extend customDimensionsParsed = parse_json(customDimensions)
| extend batchJobId = customDimensionsParsed.BatchJobId
| where isempty(_batchJobId) or tostring(batchJobId) in (_batchJobId)
| summarize ThrottledTasks = countif(name == "BatchThrottled"), RunningTasks = countif(name == " BatchTaskStart") by TimeSum=bin(timestamp, totimespan(_scale))
| project TimeSum, RunningTasks, ThrottledTasks
| order by TimeSum asc

Batch threads

Provides information about the currently running threads. This allows customers to identify whether a batch job did not start because of the lack of available threads on the batch AOS instances. As an example, let’s query the thread information to understand what the available threads are to process batch workloads.

customEvents
| where     timestamp between (_startTime .. _endTime)
| where     name == "BatchThreadInfo"
| extend    customDimensionsParsed  = parse_json(customDimensions)
| extend    infoMessageParsed       = parse_json(tostring(customDimensionsParsed.InfoMessage))
| extend    CurrentBatchTasks       = infoMessageParsed.CurrentBatchTasks
| extend    TaskQueueCount          = infoMessageParsed.TaskQueueCount
| extend    MaxThreadCount          = infoMessageParsed.MaxThreadCount
| extend    ReservedNumberOfThreads = infoMessageParsed.ReservedNumberOfThreads
| project   timestamp, Batch = tostring(split(cloud_RoleInstance, "-")[0]), AvailableThreads = todecimal(MaxThreadCount) - todecimal(CurrentBatchTasks) - todecimal(ReservedNumberOfThreads)

Rendering this on a timeline provides us with a good view on when there are no available threads for processing.

Batch failures

This will provide additional information when a certain batch job or task cannot be scheduled correctly. This is on top of the already existing error information coming from the Infolog where we have a correlation to the originating batch job. For example, to get a list of batch failures including call stack information, run the following query:

customEvents
| where timestamp between (_startTime .. _endTime)
| where name in ("BatchTaskFailure")
| extend CustomDimensionsParsed = parse_json(customDimensions)
| extend BatchJobCaption = tostring(CustomDimensionsParsed.BatchJobCaption)
| extend ActivityId = tostring(CustomDimensionsParsed.activityId)
| extend ClassName = iif((tostring(CustomDimensionsParsed.ClassName) == "<empty>"), "", tostring(CustomDimensionsParsed.ClassName))
| extend BatchJobId = tostring(CustomDimensionsParsed.BatchJobId)
| extend BatchTaskId = tostring(CustomDimensionsParsed.BatchTaskId)
| extend EventMessage = tostring(CustomDimensionsParsed.EventMessage)
| extend ExceptionType = tostring(CustomDimensionsParsed.ExceptionType)
| extend ExceptionMessage = tostring(CustomDimensionsParsed.ExceptionMessage)
| extend CallStack = tostring(CustomDimensionsParsed.CallStack)
| where isempty(_batchJobId) or tostring(BatchJobId) in (_batchJobId)
| project TimeStamp = timestamp, RoleInstance = cloud_RoleInstance, ActivityId, BatchJobCaption, ClassName, BatchJobId, BatchTaskId, EventMessage, ExceptionType, ExceptionMessage, CallStack

Batch queue

Provides information about the current queue sizes for different queues in the Priority Based Scheduling framework. Use the following query to give a timeline overview of the queue sizes.

customEvents
| where timestamp between (_startTime .. _endTime)
| where name == " BatchPBSQueuesAndBuffersSizes"
| extend CustomDimensionsParsed = parse_json(customDimensions)
| extend InfoMessageParsed = parse_json(tostring(CustomDimensionsParsed.InfoMessage))
| extend BatchLowSchedulingQueue = toint(InfoMessageParsed.BatchLowSchedulingQueue)
| extend BatchNormalSchedulingQueue = toint(InfoMessageParsed.BatchNormalSchedulingQueue)
| extend BatchHighSchedulingQueue = toint(InfoMessageParsed.BatchHighSchedulingQueue)
| extend BatchCriticalSchedulingQueue = toint(InfoMessageParsed.BatchCriticalSchedulingQueue)
| extend BatchReservedCapacitySchedulingQueue = toint(InfoMessageParsed.BatchReservedCapacitySchedulingQueue)
| extend ReadyTasksBuffer = toint(InfoMessageParsed.ReadyTasksBuffer)
| extend ReadyTasksBufferWithPriorities = toint(InfoMessageParsed.ReadyTasksBufferWithPriorities)
| project timestamp, BatchLowSchedulingQueue, BatchNormalSchedulingQueue, BatchHighSchedulingQueue, BatchReservedCapacitySchedulingQueue, ReadyTasksBuffer, ReadyTasksBufferWithPriorities

FastTrack sample accelerator dashboard for batch

With the FastTrack team, we have created an accelerator dashboard that you can download and use immediately in your test environments.

How to import the sample dashboard in Azure Data Explorer

Always import the dashboard into a non-production environment first, validate that the visualizations and queries align with your organization's data model and monitoring requirements, and only then promote it to your production Application Insights workspace.

Download the latest batch dashboard release from Github. Go to the repository starting page Dynamics 365 FastTrack FSCM Telemetry Samples. On the right side of the main page, you can find the release to download.
On the release page, you can find the assets and select the D365FSCM-Monitoring-Dashboard-Batch-v1.0.0.0.zip archive to download and extract.

Open the release zip file and locate the ADE-Dashboard-D365FO-Monitoring-Batch.json file in the package.
Import the file in Azure Data Explorer.

Name the dashboard appropriately and then click to select Datasources.
In the Datasources selection pane, input your Azure Application Insights subscription ID in the placeholder:
After updating the correct subscription ID, click on connect.
You will get a list of databases. Select your Application Insights name from the list and save the changes.
Your dashboard should have data now. Feel free to edit the queries to suit your needs.

Please don’t hesitate to share your feedback and ideas for the dashboard evolution using the post comments or by contacting us at D365AppInsights@microsoft.com.
In addition to the dashboard discussed in this blog post, there are several other sample dashboards available on the GitHub repository.
/**
* SAMPLE CODE NOTICE
*
* THIS SAMPLE CODE IS MADE AVAILABLE AS IS. MICROSOFT MAKES NO WARRANTIES, WHETHER EXPRESS OR IMPLIED,
* OF FITNESS FOR A PARTICULAR PURPOSE, OF ACCURACY OR COMPLETENESS OF RESPONSES, OF RESULTS, OR CONDITIONS OF MERCHANTABILITY.
* THE ENTIRE RISK OF THE USE OR THE RESULTS FROM THE USE OF THIS SAMPLE CODE REMAINS WITH THE USER.
* NO TECHNICAL SUPPORT IS PROVIDED. YOU MAY NOT DISTRIBUTE THIS CODE UNLESS YOU HAVE A LICENSE AGREEMENT WITH MICROSOFT THAT ALLOWS YOU TO DO SO.
*/

Comments

GS-30090853-0
Posted at

Monitoring batch workloads with Application Insights

I'm not an expert on KQL, but I have fixed the query in your example. It didn't run on my environments.
See here:

customEvents

| where timestamp between (ago(1d) .. now())

| where name in ("BatchTaskStart", "BatchTaskFinished", "BatchTaskFailure")

| extend CustomDimensionsParsed = parse_json(customDimensions)

| extend InfoMessageParsed   = parse_json(tostring(CustomDimensionsParsed.InfoMessage))

| extend ActivityId          = tostring(CustomDimensionsParsed.activityId)

| extend ClassName           = tostring(CustomDimensionsParsed.ClassName)

| extend BatchJobId          = tostring(CustomDimensionsParsed.BatchJobId)

| extend BatchJobTaskId1     = tostring(CustomDimensionsParsed.BatchJobTaskId)

| extend BatchJobTaskId2     = tostring(CustomDimensionsParsed.BatchTaskId)

| extend BatchJobTaskId      = iif(isnotempty(BatchJobTaskId1), BatchJobTaskId1, BatchJobTaskId2)

| extend StartTime = iff(name == "BatchTaskStart", timestamp, datetime(null))

| extend EndTime    = iff(name == "BatchTaskFinished", timestamp, datetime(null))

| extend ErrorTime = iff(name == "BatchTaskFailure", timestamp, datetime(null))

| extend RetryCount = iff(name == "BatchTaskStart", tostring(InfoMessageParsed.RetryCount), "")

| project StartTime, EndTime, ErrorTime, RetryCount, ActivityId, ClassName, BatchJobId, RoleInstance = cloud_RoleInstance, BatchJobTaskId

| where isnotempty(BatchJobId)

| summarize

    StartTime      = min(StartTime),

    CompletionTime = max(EndTime),

    ErrorTime      = max(ErrorTime),

    RetryCount     = take_any(RetryCount),

    ClassName      = any(ClassName),

    BatchJobId     = any(BatchJobId),

    BatchJobTaskId = any(BatchJobTaskId),

    RoleInstance   = any(RoleInstance)

by ActivityId

| where isnotempty(StartTime)

| extend BatchAOS = tostring(split(RoleInstance, "-")[0])

| project ActivityId, BatchAOS, BatchJobId, BatchJobTaskId, ClassName, StartTime, CompletionTime, ElapsedTime = CompletionTime - StartTime, RetryCount

Like (0)

Report

Community site session details