Managing my data stack with a monorepo

It has been a while since my last update on my journey building a Modern Data Stack from scratch. Today, I want to share how I set up my data infrastructure and how I am managing it using a single Github repository. Let's get started!

As explained in one of the previous articles, I have three basic components of my data infra:

Snowflake as data warehouse
Airbyte for data ingestion
dbt for data transformation

## Snowflake

Setting up Snowflake is a straightforward process. Simply visit their website and start a commitment-free trial period. You can choose any edition of the tool (standard, enterprise, etc.) and use it for 30 days. During the setup, you will need to select your cloud provider and region. In my case, I chose AWS.

Once your account is created, you will be given admin permission for the entire project. With this permission, you can set up databases and schemas, manage users and roles, configure warehouses, and make any other project-wide changes. I have already explained my data warehouse design in this article, so I proceeded to recreate it in a real project.

From the beginning, I was eager to create configuration as code in order to meticulously track all the resources in my data warehouse. This may not seem very efficient initially, especially when dealing with only a few databases, schemas, and users. However, when considering the future, with potentially 20 roles, 40 users, and numerous schemas, I would begin to forget who has access to which resources and what provisions I have made. To tackle this challenge, there are several tools available.

The first tool is the Terraform adapter for Snowflake. However, I didn't choose this option because I wanted something simpler than Terraform and more flexible. So, my second tool of choice was SnowDDL, which is a tool for declarative-style management of Snowflake resources. It took me about half a day to create all the configuration in YAML files and set up the infrastructure from scratch. In most cases, SnowDDL is an awesome tool that promotes best practices and automates many tasks. However, I decided not to use it and instead created my own tool with similar functionality tailored to my specific needs.

One of the major reasons why I wasn't satisfied with SnowDDL is its inability to grant permissions to an entire database. Instead, the tool only allows access at the schema level, while keeping database permissions restricted to administrators. This makes it difficult to use the tool in conjunction with Airbyte or dbt, as these tools require the ability to create their own schemas and therefore need permissions at the database level. Unfortunately, SnowDDL does not support this.

So, I created my own tool with Python. It took me about a week to develop, and it has fewer features compared to SnowDDL. However, it fulfils my requirements. Essentially, you can still describe your configuration using YAML. The role system is simplified yet flexible. Additionally, there is an option to destroy resources, which is not available in SnowDDL (only possible for prefixed databases).

There is another new tool called Titan, which allows you to programmatically create Snowflake resources. However, at this point, I have decided not to use other tools. I am not ruling out the possibility of migrating my configuration back to SnowDDL or Titan in the future. For now, everything is working fine, so I have decided not to pursue this further at the moment.

## Airbyte

The next tool I set up was Airbyte.

Initially, I planned to use the Airbyte Cloud version, but decided to go with the on-prem version for now. I followed this guide to set up Airbyte with Docker. First, I created an EC2 machine (`t2.large`, costing about $67 per month). Then, following the tutorial, I installed the tool by connecting to the VM using SSH.

Connecting third-party data tools to Airbyte is very straightforward. All you need to do is obtain the API credentials for the tools and then follow the wizard in Airbyte. Connecting Snowflake was also easy because I created a separate AIRBYTE_USER account in the previous step.

In no time, I had data integration up and running, syncing the first few data sources directly to my data warehouse.

## dbt

Finally, it was time to set up dbt. Here again, I used the dbt Docker image for this task. Although I could have used a Python virtual environment, I decided to experiment with Docker.

To bootstrap the dbt project, all you need to do is run the `dbt init` command and create a profiles.yml file with the necessary database credentials for the Development and Production environments (optionally). After that, I used the dbt-codegen package to create a few staging models (from ingested by Airbyte data) and successfully built them in the DEV database.

For production runs of dbt, I chose to use dbt Cloud for a few reasons. Firstly, it is very easy to start materializing your models using their scheduler. Secondly, the dashboard provides a comprehensive overview of project runs, source freshness reports, and project documentation out of the box, so I don't have to worry about them for the time being. Lastly, I am eager to try their metrics and Semantic Layer feature, which are only available in dbt Cloud.

So, that's how I managed to create a working data infrastructure single-handedly and with minimal overhead. Now, let's talk about the monorepo.

## Monorepo

Every component that I described resides in one repository and has its own folder within. The `/snowflake` folder contains configurations for the data warehouse topology. In the `/airbyte` folder, there are config files needed to start the Airbyte service, which can be used for both a locally running instance and a deployed on-prem instance. The `/dbt` folder contains my dbt project with all the analytical models.

Furthermore, each folder is isolated and has its own requirements.txt or docker-compose file. This allows for easy independent execution of each tool. And storing everything together provides a comprehensive overview of the infrastructure. Additionally, I utilize the Taskfile CLI-tool for automation. Taskfile is similar to Makefile but more user-friendly, making it simple to automate tool usage and manage environments.

What is still in plans and needs more time to be implemented:

CI/CD pipelines for every tool: to deploy Airbyte on EC2, to apply Snowflake config on Prod, to run Slim CI in dbt Cloud.
Further development my Snowflake tool to configure a few additional resources (schemas and prefixed schemas, better YAML options handling, etc)
Better readme documentation, because right now it has only minimal info on how to get started, but not on how to use all the possibilities (e.g. examples on how to create resources in my Snowflake tool).

If you have any questions about the setup feel free to ask in the comments!

9 comments

Modern Data Community

Public group

A community of data professionals building architectures with modern tools & strategies.

Leaderboard (30-day)