Activity
Mon
Wed
Fri
Sun
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
What is this?
Less
More

Memberships

3 contributions to Modern Data Community
RDS to S3 migration
Hello everyone this is my first post, recently join a company in my first official data eng role, I am looking to have some help here, I am being tasked of transfering a 150M records from a RDS MySQL table to S3 (.parquet files). The table is so hard to query that data is dropped daily, storing only 90 days. And this is one of the problems of the migration, that querying from it is impossible my first approach was with a simple lambda, mysql connector and python script and chunk it, but would take me like 2 days if i do that. Also idea is to have this data somewhere else before thinking in a Lakehouse solution. My questions are: - What services do you recommend to make this migration as fast and smooth as possible, just once. First though is glue (used before but different purpose), or DMS service (have not never used it). - What ETL you propose to make this process daily (1.5M records) comes to glue again if i am succesfull with the first bullet point - Lastly, this data is desired to be used for analytics, initially will be in S3 to make queries using athena while the team gains idea about the kpis they want to track, in the future the idea is to have it somewhere else that make it fast to query and build models with it. The whole company env. is in AWS so my first though is RedShift but I really like the efficiency and how GoogleBigQuery handles this amount of data Thank you so much for reading!
3
8
New comment 21d ago
2 likes • Mar 19
Hi Oscar, you said, your company env is the AWS, then you have two options there: 1-AWS Database Migration Service. Configure an AWS DMS task to replicate data from your MySQL database to an S3 target endpoint. You can specify the Parquet format as the target format for the data. 2-Set up AWS glue for data catalog and ETL I hope this help
Data Vault
Data Vault Modeling I just started to learn about this topic. Hub, Link, and Satellite. I am trying to understand the correlation with Kimball, or if it is an extension of it. How to effectively applied in real-world challenges
1
6
New comment Feb 14
0 likes • Feb 12
@Emile Van Der Heyde I saw the Michael vid, very informative. Thanks for sharing the article.
0 likes • Feb 12
@Tomas Truchly great explanation. Yes, I noticed data vault must be done right from the beginning. I have compiled some useful info about what I learned so far here; insurance company poc: https://github.com/jayronsoares/datavaultmodeling
What's your Data Stack?
It's one thing to read articles or watch videos about perfectly crafted data architectures and think you're way behind. But back here in reality, things get messy & nothing is ever perfect or 100% done. Most of us are usually working on architectures that are: - Old & outdated - Hacked together - Mid-migration to new tools - Non-existent Or perhaps you're one of the lucky ones that recently started from scratch and things are running smoothly. Regardless, the best way to learn what's working (and not working) is from others. I believe this could be one of the best insights this community can collectively offer each other. So let's hear it. What does your data stack look like for the following components? 1. Database/Storage 2. Ingestion 3. Transformation 4. Version Control 5. Automation Feel free to add other items as well outside of these 5, but we can focus on these to keep it organized.
8
65
New comment 17d ago
2 likes • Dec '23
I usually work with SQL + Python + Alteryx to develop pipelines.
1 like • Dec '23
@Michael Kahan Yes, Postgres and Redshift are the most used databases in my daily routine.
1-3 of 3
Bosete Kumar
2
14points to level up
@jayron-soares-5278
Data Engineer focus on SQL, Python, dbt, data quality, data modeling and data visualization

Active 21d ago
Joined Dec 27, 2023
powered by