Thought this may be insightful to those running analytical tasks:
"The car wash problem asks a simple question: “I want to wash my car. The car wash is 100
meters away. Should I walk or drive?” Every major LLM tested—Claude, GPT-4, Gemini—
recommended walking. The correct answer is to drive, because the car itself must be at the car
wash.
We ran a variable isolation study to determine which prompt architectural layer resolves
this failure. Six conditions were tested, 20 trials each, on Claude Sonnet 4.5. A bare prompt
with no system instructions scored 0%. Adding a role definition alone also scored 0%. A STAR
reasoning framework (Situation, Task, Action, Result) reached 85%. User profile injection with
physical context—car model, location, parking status—reached only 30%. STAR combined with
profile injection reached 95%. The full stack combining all layers scored 100%.
The central finding is that structured reasoning outperformed direct context injection by a
factor of 2.83×(Fisher’s exact test, p = 0.001). STAR forces the model to articulate the task
goal before generating a conclusion, which surfaces the implicit physical constraint that context
injection leaves buried. The addition of a sixth condition resolved a confound in the original
five-condition design by isolating per-layer contributions: STAR accounts for +85pp, profile
adds +10pp, and RAG provides the final +5pp to reach perfect reliability."
Read the whole paper below...