Feature Release: Chat History Token Optimization
So, when using your own openai key (and even us as a business), you notice with agent stack (tools, prompt, convo history, RAG, etc) it starts to stack up quick - especially if you have a really involved process. We implemented a token optimization model before our chat completions to ensure you get the cost savings and ill share some data at the end :) So, we are now truncating and summarizing conversation history - we noticed there are large chat completeions coming through with 300-400+ message histories. This becomes expensive overtime if its a lead you've been working or following up with for a while engaging in conversation, so we are reducing that number and summarizing the history to ensure the intelligence stays the same but the token consumption goes way down (98% decrease on larger runs) Another thing we are doing is truncating large tool call outputs within the window that are not relevant to the current task - meaning, if there are tool calls with large outputs (like get_availability), if they are not relevant to the current task at hand, we truncate the response to show the agent that the action happened but the context is shorter. This saw a huge reduction in token consuption as well (96% decrease on larger runs) Here is the before and after, this is the same exact conversation history, assistant ID, tools, custom fields, knowledge base, etc - but see the speed and cost difference and the output was the exact same message: Differences: - 35 seconds faster - 95.95% cheaper ---- Before: "error_type": null, "usage_cost": { "notes": null, "tokens": { "output": 211, "input_total": 175948, "input_cached": 0, "input_noncached": 175948 }, "total_cost": 0.353584, "model_normalized": "gpt-4o", "models_encountered": [ "gpt-4o" ], "price_used_per_million": { "input": 2.5, "cached_input": 1.25, "output": 10 }, "error_message": null, "run_time_seconds": 32.692, "returned_an_error": false, After: "run_time_seconds": 2.618, "returned_an_error": false,