Feedback and advice on ETL pipeline for AWS
Hi, I'm working on my first ever ETL pipeline project from zero, and I designed this architecture for aws. It is based on research I've done and chatting with more experienced colleagues. I'm sure it can be improved but so far I feel comfortable with it. Brief explanation of the data lake: - the "raw" bucket is intended as a landing zone, with data in its original state (for auditing purposes) - the "cleansed" bucket contains data after fixing inconsistencies, invalid values, deleting unnecesary columns, deduplicating, etc. - the "curated" bucket is intended to have the data ready to be ingested to the data warehouse, matching fact and dimension tables schemas. First of all, I'd like to know what you think in general. Then I have a couple of questions: 1. should I store the processed data in the cleansed bucket in csv or parquet would be a better option? 2. should I keep curated as an s3 bucket in parquet it would be better as a database schema? 3. Last but not least, I'm struggling with the pipeline strategy, meaning how would I perform incremental loads? (I plan to batch update daily). What techniques can I use? Any design considerations? AWS Glue jobs have a bookmarks feature to avoid reprocesing data, but if the schema is updated in the catalog, the job will fail, and process all data in the source. Thanks in advance. I couldn't think of a better place to share my concerns than this community.