Hey Community,
I'm currently facing an issue when trying to extract text from a DOCX file in my workflow. The "Extract from Text" node returns unreadable binary data instead of plain text.
Context:
- DOCX files are essentially ZIP archives containing multiple XML files.
- The actual document text is stored in word/document.xml, which is encoded in UTF-8.
- Simply using "Extract from XML" on the raw DOCX file doesn’t work because it's compressed.
- Trying to send the extracted data to Supabase Vector Store results in the error: "unsupported Unicode escape sequence 400 Bad Request".
What I Tried:
- Passing the DOCX file through "Extract from Text" → produces garbled characters.
- Changing file encoding before extracting → no success.
- Sending raw DOCX data to Supabase → results in the Unicode error.
Question:
How can I properly extract readable text from a DOCX file in n8n? Do I need an additional step to unzip the DOCX file before extracting the XML content? Is there a built-in node or best practice for handling DOCX files correctly?
Any help or guidance would be greatly appreciated! 🚀