Activity
Mon
Wed
Fri
Sun
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
What is this?
Less
More

Memberships

Data Innovators Exchange

Public • 170 • Free

7 contributions to Data Innovators Exchange
JSON data from n Satellite
Hi For some use case we are planning to put JSON data as variant column in Satellite without extracting into individual columns. We will then extract the required columns in the Business vault. Is the Is this approach acceptable in the data vault pattern. We are using Snowflake database
1
1
New comment 13h ago
1 like • 13h
Definitely acceptable! I did this a couple of times, but only when the structure of the JSON was not stable, e.g. fields appear and disappear in different deliveries, datatypes are not consistent or hierarchie levels whithin the JSON are changing. If that's the case, I usually go for the "let's handle this later" approach. Especially when the amount of data is not that much and we could virtualize the Business Vault when reading the JSON (schema-on-read). This allowed us to adjust the views without reloading something every time. But, as soon as the data is mostly "flat" and consistent, I would go for the schema-on-write approach to get 1) better downstream performance, 2) to not apply datatypes on the way out and to write more complex queries (even if it is already easy in Snowflake, but still a bit more effort).
My 5 Tips when working with Snowflake
Of course there are dozen of tips available for Snowflake, but let me share the ones which came into my mind very quickly: 1) Understand how Snowflake stores the data! They are using micro-partitions, organized in a columnar way. Micro Partitions store statistics like distinct values and value-ranges for each column. Your goal should always be to prune as much as possible from both when querying data. For example: Only select columns you really need, and apply filters on columns where the values are mostly not overlapping multiple Micro Partitions. Also think on re-clustering your data if necessary, or creating your own values with a pattern to cluster your data on (usually only necessary for huge amounts of data in one table). 2) When data is spilled to local storage while querying, is a good indicator that a bigger warehouse makes sense. I assume here that the query itself is already optimized and we are just dealing with a lot of data and maybe complex logics. But keep in mind: Increasing the size of the Snowflake Virtual Warehouse by 1 step (i.e. M -> L), doubles the costs or the same runtime! (calculated per cluster). So, when the query time is less than 50%, we achieved a win-win: faster & cheaper result! If the runtime could not be reduced by 50% or more, then you have to decide whether the quicker response is worth the money you now spend. 3) Snowflakes no-copy clones allow you to test features and fixes against your production in a very easy and fast way. It should be part of your deployment pipelines. 4) Insert-only reduces the number of versions Snowflake has to create for the Micro Partitions. Updates and Deletes cause this versioning of already existing Micro Partitions what costs time and additional storage. That also means that Data Vault with its Insert-Only approach meets the scalability factors of Snowflake! 5) The QUALIFY statement improved the code writing a lot. It is using the result of a window-function as filter, means, you don't have to write nested sub-queries with and self-joins.
8
2
New comment 12h ago
Requirement Gathering
Hey community, I am wondering what your best practices are for requirement gathering from the business to enable data engineers to work? Imagine you build a data platform for several purposes: reporting, dashboarding, cleansed data for AI input, etc. The general output are Information Marts (however they are modeled). And we want to build it the agile way. How much and what kind of information would be necessary from the business? How much in-person communication you would expect during development? A User Story itself is always nice to understand the intention, but is way too less as details are missing (in my opinion). Looking forward to your opinions:)
6
3
New comment Jul 17
0 likes • Jul 16
Agree. Documentation about decisions/agreements is very important. It's sometimes kind of an insurance :) But what degree of information would you prefer if you should deliver business value? Just a user story on a high level and a contact person (i.e. a subject matter expert)? a business model? a mockup report? KPI definitions by using a teamplate? a huge documentation page just with text? Data engineers should be able to start developing as fast as possible and not spending a lot of time in understanding the requirements.
Data Vault Friday
Who we can welcome to the Data Vault Friday tomorrow 11am. CEST? :) Ask your quesion upfront and get an answer from Michael Olschimke: https://scalefr.ee/DataVaultFriday Or just attend and learn: https://us02web.zoom.us/webinar/register/WN_TsjULDy7Tuiwp0clye0BSQ#/registration Duration: Usually 10-20 minutes.
5
0
Do Oracle users like Data Vault?
Currently tweaking datavault4dbt to work with Oracle. How come Oracle users are less interested in Data Vault than snowflake users?
6
8
New comment Jul 15
3 likes • Jul 11
Another perception from a different perspective: As Oracle is one of the "older" database vendor, companies often already have their data warehouse solution on it before Data Vault became an industrie-wide approved way of building a data platform. When it comes to the point that the whole solution does not scale anymore, they are looking for the state-of-the-art and plan an all-around refresh: New way of working (agile), new approach (Data Vault), new technology (mostly OLAP databases in the Cloud). I think as soon as it comes to these migration-projects where everything gets renewed and Data Vault is the way to go, the technology from Oracle is no longer the top favorite for a restart. The same is see for SQL-Server btw. . Not that much, but similar.
1-7 of 7
Marc Winkelmann
3
28points to level up
@marc-winkelmann-2004
Hi, my name is Marc and I am implementing Data Platforms with the focus on Data Vault 2.0. Looking forward to chat/talk with you :)

Active 9h ago
Joined Jun 27, 2024
Hanover, Germany
powered by