First things first : This is a Open Source project. No Funnel. No sell. No nothing to make that clear.
Every AI agent framework I've seen does the same thing: take a screenshot, send it to a vision model, guess where to click. It works — sometimes. The best benchmarks show 42-72% success rates and 10-20 minutes per task.
I asked a different question: why is AI looking at screens at all?
Every Windows application already describes itself as structured text through the Accessibility Layer (built for screen readers since 1997). Every button says its name. Every field shows its value. Structured. Queryable. Free.
So I built a small tool that reads this tree and stores it in a local database. Instead of "here's a screenshot, guess where Save is", the AI asks "which buttons exist?" and gets a text answer in milliseconds.
Results on day 1:
- 360 Google Sheets cells filled in 90 seconds
- Token cost per step: 50-200 instead of 2,000-5,000
- Works with any LLM — including local models
- Works with any Windows app, not just browsers
Curious what you think — is this useful for your automation workflows, or is browser-only good enough for most use cases?