Metadata-Driven Development in Azure Data Factory

melbdataguy
5 min readMar 24, 2024

--

In the realm of data engineering, harnessing the power of metadata-driven development is not merely a revolutionary concept — it’s a strategic imperative. By leveraging metadata to parameterise objects within Azure Data Factory (ADF), data engineers can significantly streamline development processes, saving valuable time and resources.

This article explores how embracing a metadata-driven approach can empower data engineers to optimise their workflows and reduce the number of objects they need to create within ADF.

The Challenge: Traditional Approaches to Data Handling in ADF

Traditionally, Azure Data Factory (ADF) provides several methods for managing data workflows. However, when faced with the task of tracking or replicating CDC (Change Data Capture) changes across numerous source-target pairs, the challenges become evident.

Picture a scenario where a data engineer must handle CDC changes for a number of source-target pairs. The conventional approach entails manually creating ADF datasets for each source and target, then meticulously mapping these pairs within the ADF mapping dataflow. This method is not only error-prone but also demands high maintenance efforts, involving significant manual development time to modify or scale the dataflows as needed.

The Solution: Metadata-Driven Development in ADF

The key to streamlining dataflows lies in parameterisation. By abstracting away the intricacies of individual source-target pairs, development becomes more agile and less error-prone. Here’s how to achieve it:

Assumptions:

  1. Source and target tables share identical column names, leveraging ADF’s auto-mapping feature.
  2. The join key or key columns used to track row changes must have consistent names across source and target tables.

Preparation:

To implement the solution, we need to build a metadata table with the following columns:

  • SourceSchema: The schema of the source table.
  • SourceTable: The name of the source table.
  • TargetSchema: The schema of the target table.
  • TargetTable: The name of the target table.
  • JoinKey: The column(s) used to join the source and target tables, tracking row changes.
  • CheckpointKey: A field for maintaining checkpoint information (further explanation provided later).
Sample metadata table

Implementation:

Step 1: Create Parameterised Datasets
Begin by creating parameterised datasets for both the source and target tables. These datasets should allow for the schema and table names to be specified as parameters.

Example: Parameterised Dataset for Source
Setting Parameters for Source Dataset

Do the same for the target, as well.

Step 2: Develop Parameterised Dataflow
Utilise the parameterised datasets to develop a parameterised dataflow within ADF.

Setting JoinKey as Parameter for the Dataflow

Ensure that the JoinKey, which is crucial for tracking row changes between source and target tables, is also set as a parameter to the dataflow. We expect the value of this parameter to be a comma-delimited string for tables with multiple join keys.

Selecting the Source_Parameterised dataset as the source
Using native CDC within mapping data flow

Within the Source options, tick Change data capture and specify “SQL Server CDC” as the type. For this exercise, we go with “Incremental changes only” as the run mode.

Selecting the Target_Parameterised Dataset as the target
Sink settings

Lastly, in the sink settings, we enable all update methods for comprehensive change handling. To accommodate tables with multiple join keys, we opt for a list representation. Using a custom expression like split($JoinKey, ','), we ensure seamless integration and flexibility across various data scenarios.

Step 3: Construct Pipeline with Metadata Lookup
Build the pipeline in ADF, incorporating a lookup activity to retrieve metadata from the designated metadata table. This metadata contains information about the source and target tables, including the schema, table names, join keys. and checkpoint keys.

Use the ForEach activity to iterate through the results of the metadata lookup, passing the values dynamically to the parameterised dataflow.

Pipeline Flow

Within the ForEach activity, we add our parameterised dataflow as an activity. Here, we leverage the values of our parameterised source and target datasets obtained from the current item()being passed in the loop.

Passing in the parameters for both Source and Target datasets
Passing in JoinKey parameter for the dataflow

Additionally, we encounter a field provided by ADF from the mapping dataflow settings — the checkpoint key. This checkpoint key serves as ADF’s mechanism for tracking deltas or maintaining a watermark for the tables within the dataflow. However, in our scenario, utilising the same checkpoint key across different tables within the loop won’t work.

To address this, we override the checkpoint key by ticking the “Override” option and passing the checkpoint key from our metadata table instead — accessed through item().CheckpointKey. This ensures that each table processed within the loop maintains its individual watermark/checkpoint key, facilitating accurate data tracking and synchronisation.

Overriding ADF checkpoint key by our own

And that’s it! By following these steps, we’ve successfully implemented metadata-driven development in ADF, significantly reducing development overhead and enhancing scalability. With the majority of our development now focused on populating the metadata table, we’ve streamlined the process and ensured adaptability to changing data requirements.

Embracing this approach not only optimises efficiency but also lays the groundwork for a more agile and responsive data ecosystem. Ready to revolutionise your data workflows? Let metadata be your guide!

This article has delved into the power of metadata-driven approaches, empowering data engineers to optimise workflows and minimise the creation of redundant objects within ADF. Subscribe now for more insights and strategies.

--

--

melbdataguy

Just a curious data engineer based in Melbourne | Self-learner