Chunk size & text splitting
Hi guys At the moment I'm working extensively with large transcripts from ±2 hour Zoom recordings. The formats of the recordings tend to differ; sometimes it's workshop style with multiple speakers; sometimes module style with a single speaker; sometimes interview style with two speakers. End-result is a ±200k vtt text file with a free-flowing, time-stamped conversation. Very little context if one chunked it but overall it make sense. I'm wondering what's the best or more effective way embed this type of data within a RAG like Pinecone? There are several options: 1. As is: each transcript as a huge text blob with specific transcript meta-data. 2. Chunk it into 10/20/100 text blobs; try and add meta-data but it will all probably share the source's meta-data. 3. Pre-process the transcript into logical blocks (topics or categories) and add these as chunks as vectors. It just seems to me that whatever text-splitting algo is available I'll be ripping out all context of the transcript before I embed it if I don't natively embed things.. But maybe I don't know enough.. I've obviously been through the related RAG videos but they all talk about the process and less so about how to ensure the inputs are good enough to get great results from the output. PS - see attached for a sample transcript. I have a couple 100 of 'em...