Choosing between two data pipeline designs
Hello everyone.
I'd like to start by apologizing for the long text. If anyone manages to get to the end and has an opinion to share or find something that doesn't make any sence, I would be grateful and happy to discuss further 😊
Currently, I am working on building a data pipeline at the company where I work to meet their needs for reporting and analysis. I don't have much practical experience with data engineering yet, this is actually the first complex project I'm working with, and since it primarily involves Fabric, I thought of sharing it with you.
What I have defined is that the data - ready for consumption - will be available in silver and gold layers in Fabric lakehouses.
Where I am having difficulty deciding is at the initial part of the pipeline where I need to decide between 2 approaches that seem to make sense.
This initial part includes an initial data validation by checking constraints and data types, handling SCD cases, and performing retroactive updates of some tables (e.g., stock prices need to be adjusted backwards according to market events).
I've attached a diagram that I believe helps visualizing. Basically, there are 3 different data sources, with (i) and (ii) being accessed via API and (iii) consisting mostly of manually filled Excel tables with registration data that will later be used to build dimensions.
In option A, an initial ETL process would be done for a PostgreSQL database. The main idea here is to have a very direct way to validate the data through constraints and typing. As for SCD cases and retroactive updates, it seems feasible to perform here as well.
In option B, the data would be loaded into a Fabric bronze lakehouse and all this initial stage would need to happen within Fabric. At first glance, this approach seems simpler and less costly because it does not require an additional platform to place the data, nor an additional copy of the data. My concern is whether it will also be simple to perform the tasks I mentioned directly in Fabric.
3
5 comments
Sérgio Barbosa
2
Choosing between two data pipeline designs
Learn Microsoft Fabric
skool.com/microsoft-fabric
Helping passionate analysts, data engineers, data scientists (& more) to advance their careers on the Microsoft Fabric platform.
Leaderboard (30-day)
Powered by