[How to Properly Extract Text from a DOCX File in n8n?] solved

Hey Community,

I'm currently facing an issue when trying to extract text from a DOCX file in my workflow. The "Extract from Text" node returns unreadable binary data instead of plain text.

Context:

DOCX files are essentially ZIP archives containing multiple XML files.
The actual document text is stored in word/document.xml, which is encoded in UTF-8.
Simply using "Extract from XML" on the raw DOCX file doesn’t work because it's compressed.
Trying to send the extracted data to Supabase Vector Store results in the error: "unsupported Unicode escape sequence 400 Bad Request".

What I Tried:

Passing the DOCX file through "Extract from Text" → produces garbled characters.
Changing file encoding before extracting → no success.
Sending raw DOCX data to Supabase → results in the Unicode error.

Question:

How can I properly extract readable text from a DOCX file in n8n? Do I need an additional step to unzip the DOCX file before extracting the XML content? Is there a built-in node or best practice for handling DOCX files correctly?

Any help or guidance would be greatly appreciated! 🚀

11 comments