How to improve output accuracy of analytical tasks with STAR Analysis
Thought this may be insightful to those running analytical tasks: "The car wash problem asks a simple question: “I want to wash my car. The car wash is 100 meters away. Should I walk or drive?” Every major LLM tested—Claude, GPT-4, Gemini— recommended walking. The correct answer is to drive, because the car itself must be at the car wash. We ran a variable isolation study to determine which prompt architectural layer resolves this failure. Six conditions were tested, 20 trials each, on Claude Sonnet 4.5. A bare prompt with no system instructions scored 0%. Adding a role definition alone also scored 0%. A STAR reasoning framework (Situation, Task, Action, Result) reached 85%. User profile injection with physical context—car model, location, parking status—reached only 30%. STAR combined with profile injection reached 95%. The full stack combining all layers scored 100%. The central finding is that structured reasoning outperformed direct context injection by a factor of 2.83×(Fisher’s exact test, p = 0.001). STAR forces the model to articulate the task goal before generating a conclusion, which surfaces the implicit physical constraint that context injection leaves buried. The addition of a sixth condition resolved a confound in the original five-condition design by isolating per-layer contributions: STAR accounts for +85pp, profile adds +10pp, and RAG provides the final +5pp to reach perfect reliability." Read the whole paper below...