How I measure my Session Health
my Live Health Stream is a real-time telemetry dashboard designed to monitor the "cognitive load" and operational stability of autonomous AI agents across different sessions and models (e.g., Gemini 3.1 Pro, Claude Sonnet 4.6). I am using both AntiGravity and Claude Code. They write to my obsidian vault, and my dashboard is a HTML/CSS version of that data along with python scripts that aggregate my json data for metrics I want to measure. Tool Error Rate % (Red Line) - What it is: The frequency at which the agent is failing to execute its tools properly (e.g., syntax errors, bad file paths). - How it’s calculated: (Failed Tool Calls / Total Tool Calls in the session) * 100. Notice on the graph how the red line spikes almost immediately after the green saturation line peaks. Context Saturation % (Green Line) - What it is: A measure of how "full" the agent's brain is getting. - How it’s calculated: (Total Input Tokens / Model's Maximum Context Window) * 100. If this hits 100%, the agent starts forgetting earlier instructions (context eviction). Lessons Learned (Yellow Bars) - What it is: The number of times an agent gets caught in a recursive failure loop (trying the exact same failed action over and over). - How it’s calculated: Programmatically tracked by analyzing the execution log for identical, sequential failed tool calls within a single session. I have a "Lessons Learned" log that I have my agents read on boot, and then if they **** up they have to record a lesson there, and update the .json file for that session where i build this graph from. Input/Output Tokens (Blue/Purple bars) Since this is not exposed I have to do an estimate: "input_tokens": input_bytes // 4, "output_tokens": output_bytes // 4, One of my core philosophies is No Vanity Metrics, and another is Things that are Measured Improve, and Ultimately, The Most Important Things Cannot Be Measured.