CI-D Inbound Data Batch Integration Patterns – Azure Data Lake CDM – Part 1

Guillaume Mouton

Like(1)

Report

Overview

This is the second blog article from the Dynamics 365 Customer Insights Data – Inbound Data Batch Integration Patterns series.

Please make sure to go through the first published blog post of the series as this post will provides you with some essential concepts and semantic about Customer Insights Data (CI-D) Data processing architecture that we’ll re-use through this post.

In this second blog post, we’ll focus on the Azure Data Lake Common Data Model Tables Integration Pattern and will cover:

Introduction to the Integration pattern
Decision factors for leveraging this pattern
Example of decisions factors walkthrough
Reference and sample architectures
Pre-Requisites and capabilities

To learn more about this integration pattern best practices and optimizations, please consult the second part of this blog article.

1. Introduction to the Integration Pattern

Many customers have already invested in Azure solutions to support their Digital Transformation journey, and more especially on Data and Analytics.

Those customers may already have enterprise-grade solutions to support their Enterprise Data Warehouse, Data lake and Data integrations requirements.

Furthermore, the customer has eventually set up dedicated IT Structures such as Analytics or Integration Centers of Excellence along with platforms support teams with a high level of skills and knowledge, and will most likely look toward fostering its investments to the best extent.

In this situation, the “Attached Lake” Integration pattern category, which includes the following patterns : “Azure Data Lake - Common Data Model Tables”, “Azure Data Lake Storage - Delta Tables” and “Azure Synapse Analytics”), where the customer provisions and manages its Source Lake, can help maximize those investments while benefiting from expertise and data integration industrialization capabilities for the CI implementation.

In addition, the Decision Factors evaluation may leave with no other choices than adopting one of the “Attached Lake” pattern category. Typically when envisioned Data Volumes and wrangling complexity is high.

The “Attached Lake” Integration pattern category, includes the following patterns :

Azure Data Lake - Common Data Model Tables
Azure Data Lake - Delta Tables
Azure Synapse Analytics

This blog post will focus on the Azure Data Lake Common Data Model Tables pattern where a customer would rely on an Azure Data Lake Storage Gen2 (ADLS Gen2) as a Source Lake hosting the data files to be ingested in CI-D.

CI-D requires the Source lake Data, provided by an ADLS Gen2, to conform with the CDM Framework, ensuring this conformity will have to be handled (to some extent, or at least understood) as part of the custom built data pipeline when choosing the “Attached Lake” route.

To learn more about working with CDM Folders in a Attached Lake please read the dedicated section in this blog post.

2. Decision Factors for leveraging this pattern

As part of this blog post series, we’re proposing 5 decision factors to be detailed and assessed when evaluating Integration Patterns.

Those decision factors are presented by descending order of priorities for the decision process.

The decision factors are described in the first blog post from this series.

2.1 Data Sources Types, Volumes and Wrangling complexity

Integration Pattern fits when : Expected Data Volumes and transformation complexities are Medium to High and / or identified Data Sources induce particular concerns (either in terms of connectivity, data types conversion complexities …).

Description
Relying on enterprise-grade integration solutions and ADLS Gen2 will allow to ingest data at scale while managing transformation complexities through extended Data Wrangling capabilities.
The Attached lake pattern requires to store data in the Source Lake in CDM format using CDM entity references.
Some solutions such as Azure Data Factory and Azure Synapse support both Data transformation and mapping based on a CDM sources and writing data to CDM entities in the Source Lake. They also provide a wide variety of connectors.

2.2 Existing Integration Solutions / Integration IT Teams

Integration Pattern fits when : Pre-existing Integration capabilities are available in the customer landscape and Integration Teams have enough availability to support the CI-D project implementation and committed on ensuring the Data Integration tasks during Run mode

Description
Overall Solution architecture should benefit from existing Customer’s Integration solutions and associated Integration teams expertise, when they exist. Thus ensuring that CI-D implementation can rely on proven industrialization and support processes.
If they do not exist in the customer context, but the other decision factors such as “Data Sources types, Volumes and wrangling complexity” are pushing toward this pattern, minimal capabilities to support the pattern can be quickly put in place through two Azure services : Azure Data Factory and Azure Data Lake Storage Gen2.
Implementing those capabilities require the customer to have an existing Azure subscription and be supported by a skilled partner for implementation and administration.

2.3 Existing Azure assets

Integration Pattern fits when : Customer has a medium to high level of pre-existing Azure services and / or Azure knowledge

Description
Many customers have already invested in Azure solutions. Though It’s not uncommon that Azure Data Lake and / or Azure Data Factory are already available inside of the Customer ecosystem.
In this situation, synergies should be seek on leveraging those services (and associated Customer IT Teams) to support this pattern.
Yet, attention must be kept on ensuring that synergies will be worked in agreement with the Customer and will have limited impacts on the Customer’s operating model.

2.4 Cost of Ownership

Integration Pattern fits when : Customer understands the CI-D implementation project’s challenges associated to his particular context. He has agreed that those challenges should be addressed through a consistent solution architecture that will ensure a stable and scalable implementation of CI-D, even though this implementation could induce higher build and run costs that in the “Microsoft Power Query” Integration Pattern.

Description
Achieving the Azure Data Lake Common Data Model Tables pattern implies the proper management of the Lake. Meaning that resources must have time dedicated to maintenance, support and operations of the lake to avoid turning it into a swamp.
It also requires Data Flows to be build and operated in an industrial manner.

2.5 Time to market

Integration Pattern fits when : Customer has understood the CI-D implementation project’s challenges associated to his particular context. He understands his implementation must include additional work items to achieve a consistent CI-D Solution Architecture.

Description
CI-D implementation will be dependent of Integration Teams availability and exposure to the implementation project. Most likely, other IT Teams should also get involved (Azure Admins, IS Security, Customer Data Lake admins …) into the project at some point.
This pattern implies higher project management skills but can support a fast and agile delivery of the CI-D solution if properly managed.

3. Example of decisions factors walkthrough

3.1 Context

This walkthrough is based on a fictitious customer implementation of CI-D.

“Natural Blends” (the customer) is the market leading vendor of organic coffee & tea blends in Western Europe, satisfying millions of customers.

The Customer sells his products through E-Retailers, operates its own Point-Of-Sales in Western Europe and also started to operate an owned E-Commerce website during the COVID pandemic.

The Customer is looking to increase its global revenue by building Omnichannel engagement capabilities, improving customer satisfaction through increased personalization and build a trusted relationship with theirs customers that will help increase revenue.

The Customer has already invested in Azure cloud services and his Data & Analytics Center Of Excellence started implementing a new Cloud Enterprise Datawarehouse (leveraging Azure Data Factory, Azure Data Lake Storage Gen2, Azure Synapse and Power BI), moving away from an OnPrem “legacy” EDW (Enterprise Data Warehouse) based on SQL Server and SSAS.

Its primary objectives are:

Break its data silos and create a unified view of his B2C customers, through the consolidation of its multiple sources of data, that will support its omnichannel engagement strategy and provide his Business users with advanced Customers analytics.
Improve his POS experience by empowering the In-Store Advisor with an app that will help them get an exhaustive knowledge of the customers (transactions history online and offline, brands affinity…) and promote personalization.*

3.2 Data Sources

Four main Data Sources were identified:

POS : based on Cegid Retail, 5 M B2C Customer profiles, around 100K Sales orders / daily
ERP : recently transitioned from SAP ECC to S/4 Hana in Azure, around 5 K daily Sale-In transactions with E-Retailers
Ecommerce: based on Sales Force Commerce Cloud (SFCC), 10 M B2C Customer profiles registered, around 125 K ecom orders daily
SellOut Data : data is provided monthly typically in CSV / Plain-Text files formats by the Customer’s biggest E-retailers and SellOut third-party providers. SellOut Data induce integration challenges and heavy transformation steps as each source file may be different from another in terms of structures, content and granularity.

3.3 Decision Factors assessment

Decision Factors	Assessment
Data Sources Types, Volumes and Wrangling complexity	Data Sources are of medium to High complexity type. No native PQ connectivity are available for Cegid. Relying on existing connectivity for SFCC and SAP HANA would raise some concerns to work around existing limits : for Salesforce see exiting limits in both SF Reports API and SF Objects API, SAP Hana will not allow to natively access Abap-CDS views provided by the S/4 Business Content. Additionally, SellOut is a clear challenge as it’s provided through numerous and heavy flat files, none of them sharing the same data structure and granularity. Data Volume is medium to high : 15 M (unmatched) Customer profiles, 225 K daily order lines (online + offline) Wrangling complexity is high, particularly due to the SellOut Sources conflation.
Existing Integration Solutions / Integration IT Teams	Customer has an Data & Analytics Center of Excellence which is currently working on creating Data pipelines to its Cloud EDW. Most likely, identified Data Sources for the CI-D Implementation would be required by the EDW at some point and synergies could be found.
Existing Azure assets	Customer has already invested in Azure services that can be leveraged for its CI-D Implementation project : Azure Data Factory for Data transformation pipelines and ADLS Gen2 for the Source Lake.
Cost of Ownership	As the Data Sources ingestion must happen as part of the Customer EDW modernization initiative, a limited mark-up may be added for the CI-D implementation specific Data preparation steps.
Time to market	The CI implementation project duration must take into account the dependency on the Data & Analytics CoE availability.

4. Reference and sample architectures

4.1 Batch Data Processing Architecture – Data Integration with Azure Data Lake Storage

4.2 Implementation example

5. Pattern Pre-requisites and Capabilities

5.1 Pattern pre-requisites

All pre-requisites are provided by CI-D public documentation : Connect to Common Data Model tables in Azure Data Lake Storage - Dynamics 365 Customer Insights | Microsoft Learn

5.2 Incremental Data

The Azure Data Lake Common Data Model Tables pattern supports leveraging incremental data sets that are brought to your lake. To learn how to implement incremental ingestion for this pattern please consult our public documentation.

5.3 Accessing firewalled ADLS Gen2 Storage Account

Customer Insights Data allows you to connect to a Storage Account protected by a virtual network and not exposed on Public networks by setting a private link to this storage account.
To access a protected / firewalled Storage Account please refer to this documentation.

Comments

*This post is locked for comments

Community site session details

CI-D Inbound Data Batch Integration Patterns – Azure Data Lake CDM – Part 1

Comments

Ramesh Kumar – Community Spotlight

Congratulations to the June Top 10 Community Leaders!

Announcing the Engage with the Community forum!