Benchmark Dataflow gen2 vs Notebook vs T-SQL
Today, I decided to make a small benchmark between these 3 tools ๐ฏย Benchmark setup 2 data volumes:ย 100kย andย 5M rows 3 scenarios: Simple Intermediate (adding columns, filters, etc.) Advanced (joins, complex calculated columns, window functions, indexesโฆ) 3 tools compared: Dataflow Gen2 Spark notebook TโSQL in the warehouse The detailed results are shown in the attached chart.โ ๐ย Key takeaways 1)ย TโSQL is the clear winner Across all scenarios, itโs either the fastest or very close to the top. Even for complex transformations (window functions, aggregations, joins), the Fabric SQL engine is extremely efficient. 2)ย Dataflow Gen2: very sensitive to volume 100k rows:ย 35โ41 sย โ acceptable. 5M rows:ย 39โ105 sย โ execution time explodes. 3)ย Spark notebook: stable but never first Execution time is fairly consistent (~40โ53 s), but it doesnโt dominate the advanced 5Mโrow scenario like I expected. Itโs also penalized by ~10 s of cluster startupโฆ 4)ย Scale completely changes the picture Atย 100k rows, everything looks โfastโ (3โ41 s). Atย 5M rows, the gap grows toย 1ร vs 6รย between tools. 5) CU consumption: similar Across the tests, all three options sit in a similar range ofย about 9โ11 CUย for the workloads. The big difference: if you donโt stop the Spark notebook, it keeps consuming CUs at the same level even when itโs just โwaitingโ, whereas TโSQL and Dataflow Gen2 stop consuming once the workload is done. My personal verdict Strong SQL team โ TโSQL Fastest, most stable, and very competitive even for complex workloads. Need lowโcode / Power Query team โ Dataflow Gen2 Great tool, but you really need toย watch Query Foldingย carefully. Need predictable scalability / data engineering use cases โ Spark notebook Stable performance, ideal when you want to avoid surprises. Any of you already make this kind of benchmark ? What was your results ? Ps: if you want to support me, do not hesistate to share or like my linkedin post ;-)