Sorry in advance for the long post!. I saw Vinayak's post in this category and thought it might be the best place for this as well.
The image attached is a draft workspace architecture proposal for an organisation, and I would appreciate any and all feedback. Of particular interest are potential issues or problems with the design you might see. Much of this is just an aggregation of ideas found in various forums and videos (including many of Will's) and I believe is fairly generic.
I know a definitive architecture really isn't possible without business context, so I've listed below some of my reasoning for certain design decisions. It would be accompanied by a description of each component when proposed, hence the numerical values in the image (I won't add that here). I've also attached a simplified version showing PROD only.
Please throw your thoughts at me!
Why a DEV capacity?
· Company policy
· Separation of workload in case of engineering errors while developing
Why a workspace per source?
· Separation in case failure, flexible, modular and scalable
· Security, lack of schema definitions in Lakehouses
· Some sources may come from multiple systems where related
Why Bronze and Silver in separate Lakehouse’s in the same workspace?
· Security, lack of schema definitions in Lakehouses
· No pipeline invocation across workspaces, so this allows orchestration to Silver layer
· Some workspaces may also contain a Landing layer, depending on source
· At initial ingestion, data will be landed and classified (data governance - PII, SPI) prior to moving to Silver
Why Gold in separate workspace?
· Some systems will be landed for historisation only, and may not require a gold layer
· Our gold layer will often be a combination of multiple sources
· This will be the first touch point for "prosumers" - producers of reports (developers), consumers of data.
· SPI and PII classified data will require obfuscation upon landing in Gold
Why Lakehouses in Bronze and Silver, Warehouses in Gold?
· Lakehouses in Engineering domain
- landing structured and unstructured data
- multiple languages, incl Python (pandas), Pyspark, T-SQL (read only) etc, plays to engineers strengths
- complex workloads, available to Data Scientists (eventually if needed)
· Warehouses in Gold layer
- plays to report developer strengths that may be more familiar with SQL and star schemas
- dynamic data masking available in Warehouse
- consumer ready - cleansed, conformed, enriched, business language
- schema availability - multiple related data products in a single Warehouse
Why only TEST/PROD at consumer end?
· Power BI desktop will be used for building semantic models and Reports - effectively becoming a DEV/TEST component
· Apps will be used for report consumption for the most part - this will be considered the final PROD. Limited users will be provisioned access at lower levels
Other Points of consideration
· Permissions will be granted at workspace level rather than item level wherever possible
· Entra ID groups to be used for permission grants
· Deployment pipelines not complete between Silver and Gold layers (i.e. not all items are deployed yet) so banking on functionality increasing soon
· Shortcuts used across LH and WH wherever possible to minimise data movement
· Data residency requirements met with a single capacity region.
· Cyber/Security drives "Least Privilege" access, so most users will be read only where they have access. Only admins and data team will be allowed to hit any engineering workspace.