CAG, A Fresh Approach to LLM Document Memory Using 128K Context Windows

Hey everyone,

i am a complete novice at this, so be gentle

In this project I explore various Memory Management possibilities for local AI, Focusing on CAG.

Note i focus on production ready and tangible result rather thatn useless Buzzwords lik AI agents, that are designed to remind you to poop.

I'm sharing my project llama-cag-n8n - a complete implementation of Context-Augmented Generation (CAG) for document processing using large context window models.

The purpose is to have a reliable LLM to communicate with a fixed dataset with precision. My personal usecase is it has to interact with company handbook and manuals for precise answers to querries.

my work is inspired by "dont do RAG" Paper that discusses CAG

Wha is CAG (i have explanation on my GIT)

The TL;DR:

Instead of chunking documents into tiny pieces like traditional RAG, this system lets models process entire documents at once (up to 128K tokens)
It creates "memory snapshots" (KV caches) of documents that can be instantly loaded for future queries
Much faster responses with deeper document understanding
Works offline, Mac-compatible, and integrates easily via n8n

The Document Processing Sauce

Core is the way it handles document preprocessing. I'm using Claude 3.7 Sonnet's 200K input and ability to output exactly 128K tokens to create optimized documents. This sidesteps the need for complex chunking strategies - Claude automatically handles redundancy removal and optimization while preserving all critical information.

I mean it is always possible to replace this with a more involved OCR, chunking workflow, but this is not priority for me if i can get away with a more simple solution for now.

The workflow:

Claude processes the document to fit within the 128K output window
The optimized document is sent directly to the KV cache creation process
The model's internal state after reading the document is saved as a KV cache

This approach is only possible because of Claude's recent capability to produce outputs matching exactly the token limit needed for the KV cache target in its "simple" design

Streamlined Retrieval

For document querying, I've implemented a direct approach using the CAG bridge component that loads the pre-computed KV caches. This gives responses in seconds rather than the much longer time needed to reprocess documents.

While the primary focus is on CAG, the system is designed to work alongside traditional RAG when needed:

CAG provides deep understanding of specific documents
RAG can be used for broader knowledge when appropriate
An intelligent agent can choose the best approach per query

The template is not optimised and there are various ways to implement this to production

Why This Matters

The 128K context window is a game-changer for document processing in my opinion for small local llm. Instead of having models try to understand fragmented chunks, they can comprehend entire documents at once, maintaining awareness across sections and providing more coherent answers.

All the code is available in my GitHub repo with step-by-step setup instructions. there are many smaller mistakes here and there for sure, so i am still debugging.

Please understand this is explorative in nature, so there are bound to be mistakes, oversight.

From a purely CAG focused project it became more of an exploration and a template for me to select and combine appropriate Memory Management techniques among known variations that are enabled with today technology.

I would appreciate any feedback, namely on n8n workflows. I included detailed explanations.

https://github.com/AbelCoplet/llama-cag-n8N

EDIT:

if anyone is interrested, i am studying scenarios and apllications for this either as a reliable Chatbot interface, a loop to control RAG output for hallucinations with Metadata because local rag is very much imperfect, a RAG enhancement system of sorts if you will, or as a standalone solution for like i said, comany handbooks, manuals, and/or very precise and rigid datasets that can be made compact enough to fit into cache. I understand there are industrial grade solutions out there, but that is out of reach. and i found no actionable explanations ready for deployment locally today.

0 comments