CAG, A Fresh Approach to LLM Document Memory Using 128K Context Windows
Hey everyone,
i am a complete novice at this, so be gentle
In this project I explore various Memory Management possibilities for local AI, Focusing on CAG.
Note i focus on production ready and tangible result rather thatn useless Buzzwords lik AI agents, that are designed to remind you to poop.
I'm sharing my project llama-cag-n8n - a complete implementation of Context-Augmented Generation (CAG) for document processing using large context window models.
The purpose is to have a reliable LLM to communicate with a fixed dataset with precision. My personal usecase is it has to interact with company handbook and manuals for precise answers to querries.
my work is inspired by "dont do RAG" Paper that discusses CAG
Wha is CAG (i have explanation on my GIT)
The TL;DR:
  • Instead of chunking documents into tiny pieces like traditional RAG, this system lets models process entire documents at once (up to 128K tokens)
  • It creates "memory snapshots" (KV caches) of documents that can be instantly loaded for future queries
  • Much faster responses with deeper document understanding
  • Works offline, Mac-compatible, and integrates easily via n8n
The Document Processing Sauce
Core is the way it handles document preprocessing. I'm using Claude 3.7 Sonnet's 200K input and ability to output exactly 128K tokens to create optimized documents. This sidesteps the need for complex chunking strategies - Claude automatically handles redundancy removal and optimization while preserving all critical information.
I mean it is always possible to replace this with a more involved OCR, chunking workflow, but this is not priority for me if i can get away with a more simple solution for now.
The workflow:
  1. Claude processes the document to fit within the 128K output window
  2. The optimized document is sent directly to the KV cache creation process
  3. The model's internal state after reading the document is saved as a KV cache
This approach is only possible because of Claude's recent capability to produce outputs matching exactly the token limit needed for the KV cache target in its "simple" design
Streamlined Retrieval
For document querying, I've implemented a direct approach using the CAG bridge component that loads the pre-computed KV caches. This gives responses in seconds rather than the much longer time needed to reprocess documents.
While the primary focus is on CAG, the system is designed to work alongside traditional RAG when needed:
  • CAG provides deep understanding of specific documents
  • RAG can be used for broader knowledge when appropriate
  • An intelligent agent can choose the best approach per query
The template is not optimised and there are various ways to implement this to production
Why This Matters
The 128K context window is a game-changer for document processing in my opinion for small local llm. Instead of having models try to understand fragmented chunks, they can comprehend entire documents at once, maintaining awareness across sections and providing more coherent answers.
All the code is available in my GitHub repo with step-by-step setup instructions. there are many smaller mistakes here and there for sure, so i am still debugging.
Please understand this is explorative in nature, so there are bound to be mistakes, oversight.
From a purely CAG focused project it became more of an exploration and a template for me to select and combine appropriate Memory Management techniques among known variations that are enabled with today technology.
I would appreciate any feedback, namely on n8n workflows. I included detailed explanations.
EDIT:
if anyone is interrested, i am studying scenarios and apllications for this either as a reliable Chatbot interface, a loop to control RAG output for hallucinations with Metadata because local rag is very much imperfect, a RAG enhancement system of sorts if you will, or as a standalone solution for like i said, comany handbooks, manuals, and/or very precise and rigid datasets that can be made compact enough to fit into cache. I understand there are industrial grade solutions out there, but that is out of reach. and i found no actionable explanations ready for deployment locally today.
2
0 comments
Abel Coplet
2
CAG, A Fresh Approach to LLM Document Memory Using 128K Context Windows
AI Automation Society
skool.com/ai-automation-society
A community built to master no-code AI automations. Join to learn, discuss, and build the systems that will shape the future of work.
Leaderboard (30-day)
Powered by