Looking for advise on AI coding agents memory benchmark

Hi everyone, I played with Claude Code and created a memory software for coding AI agent (I don't know anything about code or software development, but something came out of the oven). Anybody knows if there a good benchmark test to validate it? Claude advised longmemeval and locomo as benchmark tests, we ran longmemeval and got somewhere around 80-90% on it (depending on how to look at the resaults, is three=3 and counts as a valid result!? Idk) but after running for 8 hours and spending around 100$ on API calls he realized that it's a wrong benchmark for the project and can't advise on an existing one. HELP🤷‍♂️

13 comments