Bosete Kumar

Modern Data Community

Activity

Mon

Wed

Fri

Sun

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Jan

Feb

Mar

Apr

May

What is this?

Less

Memberships

3 contributions to Modern Data Community

Oscar Jimenez

Mar 14

🧱 | Strategy & Design

RDS to S3 migration

Hello everyone this is my first post, recently join a company in my first official data eng role, I am looking to have some help here, I am being tasked of transfering a 150M records from a RDS MySQL table to S3 (.parquet files). The table is so hard to query that data is dropped daily, storing only 90 days. And this is one of the problems of the migration, that querying from it is impossible my first approach was with a simple lambda, mysql connector and python script and chunk it, but would take me like 2 days if i do that. Also idea is to have this data somewhere else before thinking in a Lakehouse solution. My questions are: - What services do you recommend to make this migration as fast and smooth as possible, just once. First though is glue (used before but different purpose), or DMS service (have not never used it). - What ETL you propose to make this process daily (1.5M records) comes to glue again if i am succesfull with the first bullet point - Lastly, this data is desired to be used for analytics, initially will be in S3 to make queries using athena while the team gains idea about the kpis they want to track, in the future the idea is to have it somewhere else that make it fast to query and build models with it. The whole company env. is in AWS so my first though is RedShift but I really like the efficiency and how GoogleBigQuery handles this amount of data Thank you so much for reading!

New comment 21d ago

Bosete Kumar

2 likes • Mar 19

Hi Oscar, you said, your company env is the AWS, then you have two options there: 1-AWS Database Migration Service. Configure an AWS DMS task to replicate data from your MySQL database to an S3 target endpoint. You can specify the Parquet format as the target format for the data. 2-Set up AWS glue for data catalog and ETL I hope this help

Bosete Kumar

Feb 10

🧱 | Strategy & Design

Data Vault

Data Vault Modeling I just started to learn about this topic. Hub, Link, and Satellite. I am trying to understand the correlation with Kimball, or if it is an extension of it. How to effectively applied in real-world challenges

New comment Feb 14

Bosete Kumar

0 likes • Feb 12

@Emile Van Der Heyde I saw the Michael vid, very informative. Thanks for sharing the article.

Bosete Kumar

0 likes • Feb 12

@Tomas Truchly great explanation. Yes, I noticed data vault must be done right from the beginning. I have compiled some useful info about what I learned so far here; insurance company poc: https://github.com/jayronsoares/datavaultmodeling

Michael Kahan

Dec '23

🧱 | Strategy & Design

What's your Data Stack?

It's one thing to read articles or watch videos about perfectly crafted data architectures and think you're way behind. But back here in reality, things get messy & nothing is ever perfect or 100% done. Most of us are usually working on architectures that are: - Old & outdated - Hacked together - Mid-migration to new tools - Non-existent Or perhaps you're one of the lucky ones that recently started from scratch and things are running smoothly. Regardless, the best way to learn what's working (and not working) is from others. I believe this could be one of the best insights this community can collectively offer each other. So let's hear it. What does your data stack look like for the following components? 1. Database/Storage 2. Ingestion 3. Transformation 4. Version Control 5. Automation Feel free to add other items as well outside of these 5, but we can focus on these to keep it organized.

New comment 17d ago

Bosete Kumar

2 likes • Dec '23

I usually work with SQL + Python + Alteryx to develop pipelines.

Bosete Kumar

1 like • Dec '23

@Michael Kahan Yes, Postgres and Redshift are the most used databases in my daily routine.

1-3 of 3

Level 2

14points to level up

Bosete Kumar

@jayron-soares-5278

Data Engineer focus on SQL, Python, dbt, data quality, data modeling and data visualization

Active 21d ago

Joined Dec 27, 2023

Contributions

Followers

Following