(Live build in the Hidden State Drift Mastermind)
Poetiq has achieved state-of-the-art (SOTA) performance on ARC-AGI-2 with 54% accuracy at $30.57 per problem—breaking the 50% barrier for the first time and surpassing average human performance (60% is typically human baseline). This represents a 9-point improvement over the previous SOTA (45% by Gemini 3 Deep Think) at less than half the cost($77.16 → $30.57).
Key Achievement Date: December 5, 2025 (officially verified by ARC Prize)
1. THE CORE INNOVATION: THE META-SYSTEM
What It Is
Poetiq's breakthrough is NOT a new foundation model. Instead, it's a meta-system that orchestrates existing frontier LLMs through:
- Intelligent Multi-Agent Coordination - Multiple LLM "experts" that propose solutions, evaluate feedback, and self-audit
- Test-Time Compute - Iterative reasoning and self-verification at inference time (not training time)
- Adaptive Problem-Solving - Automatically selects which models, prompting strategies, and approaches (including code generation) for each specific problem
- Cost Optimization - Achieves efficiency through intelligent early stopping and resource allocation
Fundamental Design Principles
"The prompt is an interface, not the intelligence"
- Doesn't ask a single question; uses iterative loops
- LLM generates proposed solution → receives feedback → analyzes → refines → repeats
- Multi-step self-improving process builds and perfects answers incrementally
Self-Auditing
- System autonomously decides when it has sufficient information
- Monitors its own progress and terminates when solution is satisfactory
- Minimizes wasteful computation
Why This Works for ARC-AGI-2
ARC-AGI-2 tests:
- Abstract pattern recognition - "figure out the rule from 3 examples"
- Fluid intelligence - NOT knowledge-based, requires true generalization
- Spatial reasoning - Complex visual pattern relationships
The core problem: Raw frontier models score below human baseline because their stochasticity makes knowledge extraction unreliable. Poetiq's meta-system systematizes knowledge extraction for complex reasoning.
2. TECHNICAL ARCHITECTURE
The Reasoning Loop
Input Problem
↓
Generate Proposed Solution (often via code generation)
↓
Test Against Examples
↓
Receive Feedback/Error Output
↓
Analyze Failure Patterns
↓
Refine Hypothesis & Code
↓
Repeat Until Convergence OR Termination
↓
Output Final Solution
Multi-Model Coordination
The meta-system works across multiple LLMs:
- Gemini 3 (a, b, c variants) - High accuracy variants with different compute budgets
- GPT-5.1 - Latest OpenAI flagship
- GPT-OSS-120B - Open-weights model (sub-1-cent cost per problem)
- Claude Opus 4.5 - Anthropic's flagship
- Grok 4 Fast Reasoning - xAI's reasoning model
- 12+ model families tested - GPT, Claude, Gemini, Grok, GPT-OSS
Key insight: The same meta-system adaptation works across model families and sizes with minimal modification—demonstrating robust generalization.
Code Generation as Core Capability
The system doesn't just use language—it writes and executes Python code:
- Proposes a rule as executable code
- Runs the code against provided examples
- Detects failures via error feedback
- Debugs and suggests new rules iteratively
- Tests revised solutions in a self-contained loop
This is crucial for ARC-AGI because abstract reasoning benefits from formal, testable logic.
Efficiency Mechanism
Uses <2 requests on average per problem (ARC-AGI allows up to 2 attempts)
This efficiency means:
- Fewer API calls = lower cost
- Single, well-crafted solution
- Intelligent early termination
3. PERFORMANCE METRICS (VERIFIED)
ARC-AGI-2 Semi-Private (Official Verification - Dec 5, 2025)
System
Accuracy
Cost/Problem
Efficiency
Poetiq (Gemini 3)
54%
$30.57
NEW SOTA
Gemini 3 Deep Think
45%
$77.16
Previous SOTA
Claude Opus 4.5
~50% (approx)
~$60/task
Cost inefficient
ARC-AGI-2 Public Eval Set
Poetiq systems establish entirely new Pareto frontier (better performance at every cost level):
- Exceeds human average (60%) on public set
- Multiple configurations for different cost targets
- Gemini-3-b: Near-saturation on ARC-AGI-1; continued improvement on ARC-AGI-2
Cross-Model Performance
Same meta-system applied to 12+ models across families:
- All show improvement in accuracy
- All show reduction in cost vs baseline
- Generalization confirmed across:Different model architecturesDifferent sizes (120B to frontier scale)Open and closed weights
Notable Low-Cost Results
- GPT-OSS-120B (open weights): <1 cent per problem with competitive accuracy
- Grok 4 Fast: Both cheaper AND more accurate than baseline reported numbers
- Demonstrates that better reasoning ≠ bigger models
4. HOW TO ACCESS & IMPLEMENT
Option 1: Use the Official Open-Source Code
What's Included:
- Complete solver framework for ARC-AGI problems
- Python-based implementation
- Model integration points for multiple LLMs
- Example configurations for different cost/performance targets
Limitations (Important):
- GitHub contains the framework/wrapper code for ARC-AGI specifically
- Does NOT include the proprietary meta-system optimization engine
- You get the scaffolding to run tests, but not the core "how to automatically select approaches" logic
- This is intentionally limited to show reproducibility without giving away core IP
Setup Requirements:
- Python 3.9+
- API access to frontier models (Gemini 3, GPT-5.1, Claude, Grok)
- API keys configured in environment
- ARC-AGI problem files (available from arcprize.org)
Option 2: Partner with Poetiq Directly
What They Offer:
- Custom implementation of their full meta-system for your specific problems
- Integration into existing larger systems (mentioned as capability)
- Early partner status for deploying their technology
- Consulting on task-specific optimization
Target Use Cases:
- Organizations with complex reasoning tasks
- Enterprise AI systems needing cost-efficient reasoning
- R&D teams extending their capabilities
Option 3: Leverage Public Results
All ARC-AGI-2 results are on the public evaluation set and reproducible:
- Download the public ARC-AGI-2 problem set from arcprize.org
- Use the open-source Poetiq code with your choice of LLM API
- Adapt the prompt strategies and iterative loops for your use case
5. IMPLEMENTATION CONSIDERATIONS FOR RESEARCHERS
Prerequisites
- LLM API Access (minimum 1, ideally 2+):Gemini 3 Pro API (via Google AI/Vertex AI)GPT-5.1 API (via OpenAI)Claude API (via Anthropic)
- Compute Budget:Per-problem cost ranges from <$0.01 to $30+Plan for thousands of API calls during developmentOptimize iteratively
- Development Environment:Python 3.9+ recommendedJupyter for experimentationVersion control (GitHub)
Architecture You'll Need
- Problem Parser - Convert ARC-AGI input format to internal representation
- Prompt Engine - Generate context-aware prompts for each problem
- Code Generator - LLM calls that output Python code solutions
- Test Harness - Run generated code against examples, capture failures
- Feedback Analyzer - Parse error messages and determine next steps
- Termination Logic - Decide when solution is good enough
- Model Router - Select which LLM to use for each task (or multi-model voting)
- Cost Tracking - Monitor API spend and adjust strategies
Prompt Strategy Elements
Based on Poetiq's approach:
System Prompt:
- You are an expert at solving visual pattern recognition problems
- Generate Python code that defines the transformation rule
- Your code will be tested against examples
- If it fails, fix it based on error feedback
Few-shot Examples:
- 2-3 ARC problems with solutions
Chain-of-Thought:
- Describe what you observe in the examples
- Propose a hypothesis
- Write code to test it
- Explain expected output vs input
Error Handling:
- If code fails, analyze error
- Suggest alternative rules
- Try again (iteration loop)
Cost Optimization Strategies
- Model Selection by DifficultySimple problems → GPT-OSS-120B (~$0.001)Medium problems → Grok 4 Fast (~$0.005)Hard problems → Gemini 3 / GPT-5.1 (~$0.01-0.10)
- Early TerminationIf solution passes all examples on first try → stopIf code fails multiple times → escalate to better model
- Batch ProcessingProcess similar-difficulty problems with same modelReuse successful prompts across related problems
- Single-Attempt FocusAim for <1.5 API calls per problem averageBetter planning upfront = fewer retries
6. CRITICAL CAVEATS & LIMITATIONS
Public vs Semi-Private Performance Drop
The attached document reports 54% accuracy on Semi-Private eval (official verification). However:
- ARC-AGI-1 shows significant drops from public to semi-private (varies by model)
- ARC-AGI-2 semi-private and public are more closely calibrated (smaller drops)
- When evaluating your implementation, expect 1-5% variance between test sets
What Hasn't Been Released
- The meta-system's optimization logic - How it decides which approach to use
- Exact prompt templates - Only general principles disclosed
- Model selection heuristics - The "router" that picks between models
- Training data for adaptation - They adapted on open-source models only, then transferred to frontier models
Generalization to Your Use Case
Poetiq states the system generalizes to "over a dozen different benchmarks" and "various reasoning and retrieval tasks," but:
- Detailed results on non-ARC tasks NOT yet published
- Different domains may require different prompt architectures
- Your specific task may need substantial customization
7. COMPETITIVE LANDSCAPE
Who Else Is Competing
- Gemini 3 Deep Think - Raw frontier model (45%, $77.16)
- Claude Opus 4.5 - Alternative frontier model
- Grok 4 Fast Reasoning - xAI's reasoning approach
- o1-pro (OpenAI) - Estimated SOTA on public set via scoring
Why Poetiq's Approach Is Different
- Doesn't fine-tune or update the models themselves
- Works with whatever frontier model is released
- Cost-efficient through orchestration, not scale
- Provably transfers across model families
8. BUSINESS MODEL & AVAILABILITY
Current Status (December 2025)
Open Source: Limited (solver framework)Commercial: Available through partnershipTeam: 6 researchers/engineers, 53 years combined Google DeepMind experience
Paths to Implementation
- DIY with open-source - Free, requires deep expertise
- Partner directly - Contact poetiq@poetiq.ai for enterprise deployment
- Replicate from papers - Wait for research papers (expected in 2026)
9. WHAT THIS MEANS FOR THE AI RESEARCH COMMUNITY
Key Insights
- Intelligence is extractable, not just scale - Better reasoning through orchestration vs bigger models
- Test-time compute matters - Iterative verification beats single-attempt generation
- LLM-agnostic systems are possible - Same logic works across model families
- Efficiency frontier is real - Can achieve better results for lower cost through smarter design
Implications for 2025-2026
- ARC-AGI-3 launching early 2026 (interactive reasoning, different format)
- Grand Prize remains unclaimed (>90% accuracy still possible)
- Benchmark-specific optimization vs general reasoning capability debate ongoing
- Frontier models continue releasing (Gemini 4 expected; GPT-6 rumors)
10. RECOMMENDED IMPLEMENTATION PATH FOR YOU
Phase 1: Reproduce (2-3 weeks)
- Clone the Poetiq GitHub repo
- Set up API access (Gemini 3 recommended; GPT-5.1 as backup)
- Download ARC-AGI-2 public eval set
- Implement basic iterative solver with code generation
- Test on subset of problems (100 of 400)
- Measure accuracy and cost per problem
Phase 2: Optimize (4-6 weeks)
- Implement multi-model routing
- Build early termination logic
- Refine prompt templates through experimentation
- Test all 400 public problems
- Target: 45-50% accuracy, <$20/problem average
Phase 3: Extend (ongoing)
- Apply learnings to your custom reasoning tasks
- Contact Poetiq if enterprise partnership makes sense
- Monitor for ARC-AGI-3 release and papers
- Contribute improvements back to open-source
QUICK REFERENCE: TECHNICAL SPECS
Aspect
Details
SOTA Accuracy (ARC-AGI-2)
54% (Semi-Private, verified)
Cost (per problem)
$30.57 (includes overhead)
API Calls per Problem
<2 on average
Models Supported
12+ families (Gemini, GPT, Claude, Grok, open-source)
Code Base
GitHub (limited); full system via partnership
Team Size
6 researchers (Google DeepMind alumni)
Key Innovation
Meta-system orchestration + test-time reasoning
Open Source Level
Framework wrapper (not core optimization engine)
Contact for Enterprise
Publication Status
Blog posts & results verified; papers expected 2026
This is a genuinely significant breakthrough because it decouples intelligence from model scale. The practical implication: your organization doesn't need trillion-parameter models to solve complex reasoning tasks—you need smarter orchestration of existing ones. That's a paradigm shift for enterprise AI deployment.