Hey there!
I've stumbled upon a bit of a conundrum in my data processing journey and could use some guidance from the community. Here's the scenario:
I've got three notebooks in three separate workspaces, each tasked with a specific part of my data pipeline. The first notebook diligently copies the source data to the bronze layer, the second notebook gracefully transforms the bronze data into silver, and the third notebook elegantly normalizes the silver data into gold.
However, as I progress through this pipeline, I'm encountering an issue with Spark capacity sessions. Each notebook fires up its own Spark session, leading to resource consumption concerns and possibly throttling my progress.
After some research, I've stumbled upon a potential solution: utilizing `mssparkutils.notebook` to orchestrate the execution of these three notebooks in sequential order while conserving Spark resources by using only one session.
Now, here's where I need your help:
1. **Experience Sharing**: Have any of you implemented such a solution using `mssparkutils.notebook` or a similar approach? If so, I'd love to hear about your experiences. Any insights, tips, or pitfalls to watch out for would be immensely helpful.
2. **Example Showcase**: If you have a working example or code snippet demonstrating how to use `mssparkutils.notebook` to run multiple notebooks sequentially within a single Spark session, I'd greatly appreciate it if you could share it. Seeing a practical implementation would significantly aid my understanding.
3. **Limitations and Scaling**: Additionally, I'm curious about any potential limitations with this method. Are there scalability concerns or performance bottlenecks when adding more notebooks to a single `mssparkutils.notebook` session? My plan involves potentially expanding this setup to accommodate notebooks for various source systems, so insights into scalability would be invaluable.
Looking forward to your responses and thank you in advance for your support!