Managing my data stack with a monorepo
It has been a while since my last update on my journey building a Modern Data Stack from scratch. Today, I want to share how I set up my data infrastructure and how I am managing it using a single Github repository. Let's get started! As explained in one of the previous articles, I have three basic components of my data infra: - Snowflake as data warehouse - Airbyte for data ingestion - dbt for data transformation ## Snowflake Setting up Snowflake is a straightforward process. Simply visit their website and start a commitment-free trial period. You can choose any edition of the tool (standard, enterprise, etc.) and use it for 30 days. During the setup, you will need to select your cloud provider and region. In my case, I chose AWS. Once your account is created, you will be given admin permission for the entire project. With this permission, you can set up databases and schemas, manage users and roles, configure warehouses, and make any other project-wide changes. I have already explained my data warehouse design in this article, so I proceeded to recreate it in a real project. From the beginning, I was eager to create configuration as code in order to meticulously track all the resources in my data warehouse. This may not seem very efficient initially, especially when dealing with only a few databases, schemas, and users. However, when considering the future, with potentially 20 roles, 40 users, and numerous schemas, I would begin to forget who has access to which resources and what provisions I have made. To tackle this challenge, there are several tools available. The first tool is the Terraform adapter for Snowflake. However, I didn't choose this option because I wanted something simpler than Terraform and more flexible. So, my second tool of choice was SnowDDL, which is a tool for declarative-style management of Snowflake resources. It took me about half a day to create all the configuration in YAML files and set up the infrastructure from scratch. In most cases, SnowDDL is an awesome tool that promotes best practices and automates many tasks. However, I decided not to use it and instead created my own tool with similar functionality tailored to my specific needs.