Overview
This is the second blog article from the Dynamics 365 Customer Insights Data – Inbound Data Batch Integration Patterns series.
Please make sure to go through the first published blog post of the series as this post will provides you with some essential concepts and semantic about Customer Insights Data (CI-D) Data processing architecture that we’ll re-use through this post.
In this second blog post, we’ll focus on the Azure Data Lake Common Data Model Tables Integration Pattern and will cover:
- Introduction to the Integration pattern
- Decision factors for leveraging this pattern
- Example of decisions factors walkthrough
- Reference and sample architectures
- Pre-Requisites and capabilities
To learn more about this integration pattern best practices and optimizations, please consult the second part of this blog article.
Many customers have already invested in Azure solutions to support their Digital Transformation journey, and more especially on Data and Analytics.
Those customers may already have enterprise-grade solutions to support their Enterprise Data Warehouse, Data lake and Data integrations requirements.
Furthermore, the customer has eventually set up dedicated IT Structures such as Analytics or Integration Centers of Excellence along with platforms support teams with a high level of skills and knowledge, and will most likely look toward fostering its investments to the best extent.
In this situation, the “Attached Lake” Integration pattern category, which includes the following patterns : “Azure Data Lake - Common Data Model Tables”, “Azure Data Lake Storage - Delta Tables” and “Azure Synapse Analytics”), where the customer provisions and manages its Source Lake, can help maximize those investments while benefiting from expertise and data integration industrialization capabilities for the CI implementation.
In addition, the Decision Factors evaluation may leave with no other choices than adopting one of the “Attached Lake” pattern category. Typically when envisioned Data Volumes and wrangling complexity is high.
The “Attached Lake” Integration pattern category, includes the following patterns :
- Azure Data Lake - Common Data Model Tables
- Azure Data Lake - Delta Tables
- Azure Synapse Analytics
This blog post will focus on the Azure Data Lake Common Data Model Tables pattern where a customer would rely on an Azure Data Lake Storage Gen2 (ADLS Gen2) as a Source Lake hosting the data files to be ingested in CI-D.
CI-D requires the Source lake Data, provided by an ADLS Gen2, to conform with the CDM Framework, ensuring this conformity will have to be handled (to some extent, or at least understood) as part of the custom built data pipeline when choosing the “Attached Lake” route.
To learn more about working with CDM Folders in a Attached Lake please read the dedicated section in this blog post.
As part of this blog post series, we’re proposing 5 decision factors to be detailed and assessed when evaluating Integration Patterns.
Those decision factors are presented by descending order of priorities for the decision process.
The decision factors are described in the first blog post from this series.
2.1 Data Sources Types, Volumes and Wrangling complexity
Integration Pattern fits when : Expected Data Volumes and transformation complexities are Medium to High and / or identified Data Sources induce particular concerns (either in terms of connectivity, data types conversion complexities …).
Description
Relying on enterprise-grade integration solutions and ADLS Gen2 will allow to ingest data at scale while managing transformation complexities through extended Data Wrangling capabilities.
The Attached lake pattern requires to store data in the Source Lake in CDM format using CDM entity references.
Some solutions such as Azure Data Factory and Azure Synapse support both Data transformation and mapping based on a CDM sources and writing data to CDM entities in the Source Lake. They also provide a wide variety of connectors.
Relying on enterprise-grade integration solutions and ADLS Gen2 will allow to ingest data at scale while managing transformation complexities through extended Data Wrangling capabilities.
The Attached lake pattern requires to store data in the Source Lake in CDM format using CDM entity references.
Some solutions such as Azure Data Factory and Azure Synapse support both Data transformation and mapping based on a CDM sources and writing data to CDM entities in the Source Lake. They also provide a wide variety of connectors.
2.2 Existing Integration Solutions / Integration IT Teams
Integration Pattern fits when : Pre-existing Integration capabilities are available in the customer landscape and Integration Teams have enough availability to support the CI-D project implementation and committed on ensuring the Data Integration tasks during Run mode
Description
Overall Solution architecture should benefit from existing Customer’s Integration solutions and associated Integration teams expertise, when they exist. Thus ensuring that CI-D implementation can rely on proven industrialization and support processes.
If they do not exist in the customer context, but the other decision factors such as “Data Sources types, Volumes and wrangling complexity” are pushing toward this pattern, minimal capabilities to support the pattern can be quickly put in place through two Azure services : Azure Data Factory and Azure Data Lake Storage Gen2.
Implementing those capabilities require the customer to have an existing Azure subscription and be supported by a skilled partner for implementation and administration.
Overall Solution architecture should benefit from existing Customer’s Integration solutions and associated Integration teams expertise, when they exist. Thus ensuring that CI-D implementation can rely on proven industrialization and support processes.
If they do not exist in the customer context, but the other decision factors such as “Data Sources types, Volumes and wrangling complexity” are pushing toward this pattern, minimal capabilities to support the pattern can be quickly put in place through two Azure services : Azure Data Factory and Azure Data Lake Storage Gen2.
Implementing those capabilities require the customer to have an existing Azure subscription and be supported by a skilled partner for implementation and administration.
2.3 Existing Azure assets
Integration Pattern fits when : Customer has a medium to high level of pre-existing Azure services and / or Azure knowledge
Description
Many customers have already invested in Azure solutions. Though It’s not uncommon that Azure Data Lake and / or Azure Data Factory are already available inside of the Customer ecosystem.
In this situation, synergies should be seek on leveraging those services (and associated Customer IT Teams) to support this pattern.
Yet, attention must be kept on ensuring that synergies will be worked in agreement with the Customer and will have limited impacts on the Customer’s operating model.
Many customers have already invested in Azure solutions. Though It’s not uncommon that Azure Data Lake and / or Azure Data Factory are already available inside of the Customer ecosystem.
In this situation, synergies should be seek on leveraging those services (and associated Customer IT Teams) to support this pattern.
Yet, attention must be kept on ensuring that synergies will be worked in agreement with the Customer and will have limited impacts on the Customer’s operating model.
2.4 Cost of Ownership
Integration Pattern fits when : Customer understands the CI-D implementation project’s challenges associated to his particular context. He has agreed that those challenges should be addressed through a consistent solution architecture that will ensure a stable and scalable implementation of CI-D, even though this implementation could induce higher build and run costs that in the “Microsoft Power Query” Integration Pattern.
Description
Achieving the Azure Data Lake Common Data Model Tables pattern implies the proper management of the Lake. Meaning that resources must have time dedicated to maintenance, support and operations of the lake to avoid turning it into a swamp.
It also requires Data Flows to be build and operated in an industrial manner.
Achieving the Azure Data Lake Common Data Model Tables pattern implies the proper management of the Lake. Meaning that resources must have time dedicated to maintenance, support and operations of the lake to avoid turning it into a swamp.
It also requires Data Flows to be build and operated in an industrial manner.
2.5 Time to market
Integration Pattern fits when : Customer has understood the CI-D implementation project’s challenges associated to his particular context. He understands his implementation must include additional work items to achieve a consistent CI-D Solution Architecture.
Description
CI-D implementation will be dependent of Integration Teams availability and exposure to the implementation project. Most likely, other IT Teams should also get involved (Azure Admins, IS Security, Customer Data Lake admins …) into the project at some point.
This pattern implies higher project management skills but can support a fast and agile delivery of the CI-D solution if properly managed.
CI-D implementation will be dependent of Integration Teams availability and exposure to the implementation project. Most likely, other IT Teams should also get involved (Azure Admins, IS Security, Customer Data Lake admins …) into the project at some point.
This pattern implies higher project management skills but can support a fast and agile delivery of the CI-D solution if properly managed.
3.1 Context
This walkthrough is based on a fictitious customer implementation of CI-D.
“Natural Blends” (the customer) is the market leading vendor of organic coffee & tea blends in Western Europe, satisfying millions of customers.
The Customer sells his products through E-Retailers, operates its own Point-Of-Sales in Western Europe and also started to operate an owned E-Commerce website during the COVID pandemic.
The Customer is looking to increase its global revenue by building Omnichannel engagement capabilities, improving customer satisfaction through increased personalization and build a trusted relationship with theirs customers that will help increase revenue.
The Customer has already invested in Azure cloud services and his Data & Analytics Center Of Excellence started implementing a new Cloud Enterprise Datawarehouse (leveraging Azure Data Factory, Azure Data Lake Storage Gen2, Azure Synapse and Power BI), moving away from an OnPrem “legacy” EDW (Enterprise Data Warehouse) based on SQL Server and SSAS.
Its primary objectives are:
- Break its data silos and create a unified view of his B2C customers, through the consolidation of its multiple sources of data, that will support its omnichannel engagement strategy and provide his Business users with advanced Customers analytics.
- Improve his POS experience by empowering the In-Store Advisor with an app that will help them get an exhaustive knowledge of the customers (transactions history online and offline, brands affinity…) and promote personalization.*
3.2 Data Sources
Four main Data Sources were identified:
- POS : based on Cegid Retail, 5 M B2C Customer profiles, around 100K Sales orders / daily
- ERP : recently transitioned from SAP ECC to S/4 Hana in Azure, around 5 K daily Sale-In transactions with E-Retailers
- Ecommerce: based on Sales Force Commerce Cloud (SFCC), 10 M B2C Customer profiles registered, around 125 K ecom orders daily
- SellOut Data : data is provided monthly typically in CSV / Plain-Text files formats by the Customer’s biggest E-retailers and SellOut third-party providers. SellOut Data induce integration challenges and heavy transformation steps as each source file may be different from another in terms of structures, content and granularity.
3.3 Decision Factors assessment
Decision Factors | Assessment |
Data Sources Types, Volumes and Wrangling complexity | Data Sources are of medium to High complexity type. No native PQ connectivity are available for Cegid. Relying on existing connectivity for SFCC and SAP HANA would raise some concerns to work around existing limits : for Salesforce see exiting limits in both SF Reports API and SF Objects API, SAP Hana will not allow to natively access Abap-CDS views provided by the S/4 Business Content. Additionally, SellOut is a clear challenge as it’s provided through numerous and heavy flat files, none of them sharing the same data structure and granularity. Data Volume is medium to high : 15 M (unmatched) Customer profiles, 225 K daily order lines (online + offline) Wrangling complexity is high, particularly due to the SellOut Sources conflation. |
Existing Integration Solutions / Integration IT Teams | Customer has an Data & Analytics Center of Excellence which is currently working on creating Data pipelines to its Cloud EDW. Most likely, identified Data Sources for the CI-D Implementation would be required by the EDW at some point and synergies could be found. |
Existing Azure assets | Customer has already invested in Azure services that can be leveraged for its CI-D Implementation project : Azure Data Factory for Data transformation pipelines and ADLS Gen2 for the Source Lake. |
Cost of Ownership | As the Data Sources ingestion must happen as part of the Customer EDW modernization initiative, a limited mark-up may be added for the CI-D implementation specific Data preparation steps. |
Time to market | The CI implementation project duration must take into account the dependency on the Data & Analytics CoE availability. |
4.1 Batch Data Processing Architecture – Data Integration with Azure Data Lake Storage
4.2 Implementation example
5.1 Pattern pre-requisites
All pre-requisites are provided by CI-D public documentation : Connect to Common Data Model tables in Azure Data Lake Storage - Dynamics 365 Customer Insights | Microsoft Learn
5.2 Incremental Data
The Azure Data Lake Common Data Model Tables pattern supports leveraging incremental data sets that are brought to your lake. To learn how to implement incremental ingestion for this pattern please consult our public documentation.
5.3 Accessing firewalled ADLS Gen2 Storage Account
Customer Insights Data allows you to connect to a Storage Account protected by a virtual network and not exposed on Public networks by setting a private link to this storage account.
To access a protected / firewalled Storage Account please refer to this documentation.
*This post is locked for comments