Efficient Data Loading Strategy in lakehouse

Stéphane Michel

May '24 • 🤓 Tech Chat

Hello everyone,

I’m working with a large dataset spanning from 2000 to 2024, stored in .CSV format.

The data from 2000 to 2023 is static and will not change, while the data for 2024 will be updated daily.

I’m considering whether to:

Create two separate tables in the Bronze layer for the historical (2000-2023) and the 2024 data, and then merge them in the Silver layer.
Save the historical data as Parquet files and load the CSV only for 2024 updates.

Additionally, I’m using Microsoft Fabric notebook for this task and wondering about the best practices for optimization. Should I rely on automatic optimization features, or should I schedule OPTIMIZE commands manually?

Any insights or experiences with similar data loading and optimization strategies would be greatly appreciated!

Thanks!

5 comments