I have been getting nowhere with a problem in my recently created Fabric ETL from an on prem to Fabric Data Warehouse, via a Fabric Data Lake, using Data Factory, and underlying SQL Stored Procedures. I wonder if the issue is something really obvious; a lack of understanding of how delta parquet works, or a bug in Fabric (which I doubt).
The ELT pattern is standard (I think).
- Use table of tables to iterate over using DF fore each loop; for each row
- First copy data from source to Lakehouse delta table using DF Copy Activity with "Overwrite" option
- Then trigger a stored procedure which
- -->first check if destination table exists,
- -->if it does not it uses CTAS to create with a select * from Delta,
- -->if it does exist, it first deletes all matching rows from destination tables joining on key column to delta table
- -->Then inserts into destination using select * from delta.
My problem is this pattern is occasionally causing duplicate rows in my detonation table.
What I have checked/tried
- Is it possible my on-prem source has duplicates - NO
- Is the Copy Activity not truly replacing the delta table, and therefore there are "ghost records"; to ensure this, I introduced a delete table step before the load into the delta
- Is it a latency between the data being deleted from the destination table, and the insert of the updated/new records from the delta - I introduced a "wait" step went up to 30 seconds, but felt if I had to increase this by more it would massively undermine the ETL process so still not sure.
- Split the SP into 2 SPs, the first would delete the rows from destination table (matching to delta); the second would insert the new/updated rows.
- Finally I convinced myself there must be an issue with the DF Stored Procedure activity; so executed the SP code via a Lookup activity.
Nothing has worked.
I think I am missing something really obvious.
Can anyone help; even point me to a resource that takes me through the step-by-step of setting up a similar ETL?
Any help/advice appreciated.