Hi everyone,
I'm working on a challenging data migration project and could use some collective wisdom from the community.
Current Situation
I've set up a clean Supabase system for a client that logs their Slack workspace messages, making everything easily queryable and accessible. This works great for new messages going forward.
The problem: Before my involvement, they were dumping all Slack messages into Google Docs as raw text (essentially just appending new messages). I now have about 60 Google Docs full of unstructured, messy Slack history that contains valuable project updates and context.
The Challenge
I want to retroactively import all this historical data into Supabase with proper timestamps and structure to make it queryable alongside the new data. This would provide my client with a complete timeline of communications and enable valuable insights from their entire Slack history.
I've tried using Gemini to process these docs, but it's painfully slow given the volume and messiness of the data.
Approaches I'm Considering
- RAG (Retrieval-Augmented Generation): Drop all docs into a RAG system, but I'm concerned this won't preserve the temporal context or allow for proper querying by date/project/user.
- Custom Parsing Scripts: Write scripts to identify message patterns and structure, but the inconsistency in the Google Docs makes this challenging.
- Manual Processing: Not really feasible given the volume.
What I'm Looking For
Has anyone tackled a similar problem of converting raw, unstructured text logs into properly structured, timestamp-based tabular data? Any tools, approaches, or frameworks you'd recommend specifically for extracting Slack-like messages from text dumps?
Any insights would be greatly appreciated!