Delta load - Duplicate records
I have been getting nowhere with a problem in my recently created Fabric ETL from an on prem to Fabric Data Warehouse, via a Fabric Data Lake, using Data Factory, and underlying SQL Stored Procedures. I wonder if the issue is something really obvious; a lack of understanding of how delta parquet works, or a bug in Fabric (which I doubt).
The ELT pattern is standard (I think).
  1. Use table of tables to iterate over using DF fore each loop; for each row
  2. First copy data from source to Lakehouse delta table using DF Copy Activity with "Overwrite" option
  3. Then trigger a stored procedure which
  4. -->first check if destination table exists,
  5. -->if it does not it uses CTAS to create with a select * from Delta,
  6. -->if it does exist, it first deletes all matching rows from destination tables joining on key column to delta table
  7. -->Then inserts into destination using select * from delta.
My problem is this pattern is occasionally causing duplicate rows in my detonation table.
What I have checked/tried
  1. Is it possible my on-prem source has duplicates - NO
  2. Is the Copy Activity not truly replacing the delta table, and therefore there are "ghost records"; to ensure this, I introduced a delete table step before the load into the delta
  3. Is it a latency between the data being deleted from the destination table, and the insert of the updated/new records from the delta - I introduced a "wait" step went up to 30 seconds, but felt if I had to increase this by more it would massively undermine the ETL process so still not sure.
  4. Split the SP into 2 SPs, the first would delete the rows from destination table (matching to delta); the second would insert the new/updated rows.
  5. Finally I convinced myself there must be an issue with the DF Stored Procedure activity; so executed the SP code via a Lookup activity.
Nothing has worked.
I think I am missing something really obvious.
Can anyone help; even point me to a resource that takes me through the step-by-step of setting up a similar ETL?
Any help/advice appreciated.
1
8 comments
John OCallaghan
3
Delta load - Duplicate records
Learn Microsoft Fabric
skool.com/microsoft-fabric
Helping passionate analysts, data engineers, data scientists (& more) to advance their careers on the Microsoft Fabric platform.
Leaderboard (30-day)
Powered by