poetiq:Technical Analysis for Implementation
(Live build in the Hidden State Drift Mastermind)
Poetiq has achieved state-of-the-art (SOTA) performance on ARC-AGI-2 with 54% accuracy at $30.57 per problem—breaking the 50% barrier for the first time and surpassing average human performance (60% is typically human baseline). This represents a 9-point improvement over the previous SOTA (45% by Gemini 3 Deep Think) at less than half the cost($77.16 → $30.57).
Key Achievement Date: December 5, 2025 (officially verified by ARC Prize)
1. THE CORE INNOVATION: THE META-SYSTEM
What It Is
Poetiq's breakthrough is NOT a new foundation model. Instead, it's a meta-system that orchestrates existing frontier LLMs through:
  1. Intelligent Multi-Agent Coordination - Multiple LLM "experts" that propose solutions, evaluate feedback, and self-audit
  2. Test-Time Compute - Iterative reasoning and self-verification at inference time (not training time)
  3. Adaptive Problem-Solving - Automatically selects which models, prompting strategies, and approaches (including code generation) for each specific problem
  4. Cost Optimization - Achieves efficiency through intelligent early stopping and resource allocation
Fundamental Design Principles
"The prompt is an interface, not the intelligence"
  • Doesn't ask a single question; uses iterative loops
  • LLM generates proposed solution → receives feedback → analyzes → refines → repeats
  • Multi-step self-improving process builds and perfects answers incrementally
Self-Auditing
  • System autonomously decides when it has sufficient information
  • Monitors its own progress and terminates when solution is satisfactory
  • Minimizes wasteful computation
Why This Works for ARC-AGI-2
ARC-AGI-2 tests:
  • Abstract pattern recognition - "figure out the rule from 3 examples"
  • Fluid intelligence - NOT knowledge-based, requires true generalization
  • Spatial reasoning - Complex visual pattern relationships
The core problem: Raw frontier models score below human baseline because their stochasticity makes knowledge extraction unreliable. Poetiq's meta-system systematizes knowledge extraction for complex reasoning.
2. TECHNICAL ARCHITECTURE
The Reasoning Loop
Input Problem
Generate Proposed Solution (often via code generation)
Test Against Examples
Receive Feedback/Error Output
Analyze Failure Patterns
Refine Hypothesis & Code
Repeat Until Convergence OR Termination
Output Final Solution
Multi-Model Coordination
The meta-system works across multiple LLMs:
  • Gemini 3 (a, b, c variants) - High accuracy variants with different compute budgets
  • GPT-5.1 - Latest OpenAI flagship
  • GPT-OSS-120B - Open-weights model (sub-1-cent cost per problem)
  • Claude Opus 4.5 - Anthropic's flagship
  • Grok 4 Fast Reasoning - xAI's reasoning model
  • 12+ model families tested - GPT, Claude, Gemini, Grok, GPT-OSS
Key insight: The same meta-system adaptation works across model families and sizes with minimal modification—demonstrating robust generalization.
Code Generation as Core Capability
The system doesn't just use language—it writes and executes Python code:
  • Proposes a rule as executable code
  • Runs the code against provided examples
  • Detects failures via error feedback
  • Debugs and suggests new rules iteratively
  • Tests revised solutions in a self-contained loop
This is crucial for ARC-AGI because abstract reasoning benefits from formal, testable logic.
Efficiency Mechanism
Uses <2 requests on average per problem (ARC-AGI allows up to 2 attempts)
This efficiency means:
  • Fewer API calls = lower cost
  • Single, well-crafted solution
  • Intelligent early termination
3. PERFORMANCE METRICS (VERIFIED)
ARC-AGI-2 Semi-Private (Official Verification - Dec 5, 2025)
System
Accuracy
Cost/Problem
Efficiency
Poetiq (Gemini 3)
54%
$30.57
NEW SOTA
Gemini 3 Deep Think
45%
$77.16
Previous SOTA
Claude Opus 4.5
~50% (approx)
~$60/task
Cost inefficient
ARC-AGI-2 Public Eval Set
Poetiq systems establish entirely new Pareto frontier (better performance at every cost level):
  • Exceeds human average (60%) on public set
  • Multiple configurations for different cost targets
  • Gemini-3-b: Near-saturation on ARC-AGI-1; continued improvement on ARC-AGI-2
Cross-Model Performance
Same meta-system applied to 12+ models across families:
  • All show improvement in accuracy
  • All show reduction in cost vs baseline
  • Generalization confirmed across:Different model architecturesDifferent sizes (120B to frontier scale)Open and closed weights
Notable Low-Cost Results
  • GPT-OSS-120B (open weights): <1 cent per problem with competitive accuracy
  • Grok 4 Fast: Both cheaper AND more accurate than baseline reported numbers
  • Demonstrates that better reasoning ≠ bigger models
4. HOW TO ACCESS & IMPLEMENT
Option 1: Use the Official Open-Source Code
What's Included:
  • Complete solver framework for ARC-AGI problems
  • Python-based implementation
  • Model integration points for multiple LLMs
  • Example configurations for different cost/performance targets
Limitations (Important):
  • GitHub contains the framework/wrapper code for ARC-AGI specifically
  • Does NOT include the proprietary meta-system optimization engine
  • You get the scaffolding to run tests, but not the core "how to automatically select approaches" logic
  • This is intentionally limited to show reproducibility without giving away core IP
Setup Requirements:
  • Python 3.9+
  • API access to frontier models (Gemini 3, GPT-5.1, Claude, Grok)
  • API keys configured in environment
  • ARC-AGI problem files (available from arcprize.org)
Option 2: Partner with Poetiq Directly
Contact: poetiq@poetiq.aiAlternative: hello@poetiq.ai
What They Offer:
  • Custom implementation of their full meta-system for your specific problems
  • Integration into existing larger systems (mentioned as capability)
  • Early partner status for deploying their technology
  • Consulting on task-specific optimization
Target Use Cases:
  • Organizations with complex reasoning tasks
  • Enterprise AI systems needing cost-efficient reasoning
  • R&D teams extending their capabilities
Option 3: Leverage Public Results
All ARC-AGI-2 results are on the public evaluation set and reproducible:
  1. Download the public ARC-AGI-2 problem set from arcprize.org
  2. Use the open-source Poetiq code with your choice of LLM API
  3. Adapt the prompt strategies and iterative loops for your use case
5. IMPLEMENTATION CONSIDERATIONS FOR RESEARCHERS
Prerequisites
  • LLM API Access (minimum 1, ideally 2+):Gemini 3 Pro API (via Google AI/Vertex AI)GPT-5.1 API (via OpenAI)Claude API (via Anthropic)
  • Compute Budget:Per-problem cost ranges from <$0.01 to $30+Plan for thousands of API calls during developmentOptimize iteratively
  • Development Environment:Python 3.9+ recommendedJupyter for experimentationVersion control (GitHub)
Architecture You'll Need
  1. Problem Parser - Convert ARC-AGI input format to internal representation
  2. Prompt Engine - Generate context-aware prompts for each problem
  3. Code Generator - LLM calls that output Python code solutions
  4. Test Harness - Run generated code against examples, capture failures
  5. Feedback Analyzer - Parse error messages and determine next steps
  6. Termination Logic - Decide when solution is good enough
  7. Model Router - Select which LLM to use for each task (or multi-model voting)
  8. Cost Tracking - Monitor API spend and adjust strategies
Prompt Strategy Elements
Based on Poetiq's approach:
System Prompt:
- You are an expert at solving visual pattern recognition problems
- Generate Python code that defines the transformation rule
- Your code will be tested against examples
- If it fails, fix it based on error feedback
Few-shot Examples:
- 2-3 ARC problems with solutions
Chain-of-Thought:
- Describe what you observe in the examples
- Propose a hypothesis
- Write code to test it
- Explain expected output vs input
Error Handling:
- If code fails, analyze error
- Suggest alternative rules
- Try again (iteration loop)
Cost Optimization Strategies
  1. Model Selection by DifficultySimple problems → GPT-OSS-120B (~$0.001)Medium problems → Grok 4 Fast (~$0.005)Hard problems → Gemini 3 / GPT-5.1 (~$0.01-0.10)
  2. Early TerminationIf solution passes all examples on first try → stopIf code fails multiple times → escalate to better model
  3. Batch ProcessingProcess similar-difficulty problems with same modelReuse successful prompts across related problems
  4. Single-Attempt FocusAim for <1.5 API calls per problem averageBetter planning upfront = fewer retries
6. CRITICAL CAVEATS & LIMITATIONS
Public vs Semi-Private Performance Drop
The attached document reports 54% accuracy on Semi-Private eval (official verification). However:
  • ARC-AGI-1 shows significant drops from public to semi-private (varies by model)
  • ARC-AGI-2 semi-private and public are more closely calibrated (smaller drops)
  • When evaluating your implementation, expect 1-5% variance between test sets
What Hasn't Been Released
  1. The meta-system's optimization logic - How it decides which approach to use
  2. Exact prompt templates - Only general principles disclosed
  3. Model selection heuristics - The "router" that picks between models
  4. Training data for adaptation - They adapted on open-source models only, then transferred to frontier models
Generalization to Your Use Case
Poetiq states the system generalizes to "over a dozen different benchmarks" and "various reasoning and retrieval tasks," but:
  • Detailed results on non-ARC tasks NOT yet published
  • Different domains may require different prompt architectures
  • Your specific task may need substantial customization
7. COMPETITIVE LANDSCAPE
Who Else Is Competing
  • Gemini 3 Deep Think - Raw frontier model (45%, $77.16)
  • Claude Opus 4.5 - Alternative frontier model
  • Grok 4 Fast Reasoning - xAI's reasoning approach
  • o1-pro (OpenAI) - Estimated SOTA on public set via scoring
Why Poetiq's Approach Is Different
  • Doesn't fine-tune or update the models themselves
  • Works with whatever frontier model is released
  • Cost-efficient through orchestration, not scale
  • Provably transfers across model families
8. BUSINESS MODEL & AVAILABILITY
Current Status (December 2025)
Open Source: Limited (solver framework)Commercial: Available through partnershipTeam: 6 researchers/engineers, 53 years combined Google DeepMind experience
Paths to Implementation
  1. DIY with open-source - Free, requires deep expertise
  2. Partner directly - Contact poetiq@poetiq.ai for enterprise deployment
  3. Replicate from papers - Wait for research papers (expected in 2026)
9. WHAT THIS MEANS FOR THE AI RESEARCH COMMUNITY
Key Insights
  1. Intelligence is extractable, not just scale - Better reasoning through orchestration vs bigger models
  2. Test-time compute matters - Iterative verification beats single-attempt generation
  3. LLM-agnostic systems are possible - Same logic works across model families
  4. Efficiency frontier is real - Can achieve better results for lower cost through smarter design
Implications for 2025-2026
  • ARC-AGI-3 launching early 2026 (interactive reasoning, different format)
  • Grand Prize remains unclaimed (>90% accuracy still possible)
  • Benchmark-specific optimization vs general reasoning capability debate ongoing
  • Frontier models continue releasing (Gemini 4 expected; GPT-6 rumors)
10. RECOMMENDED IMPLEMENTATION PATH FOR YOU
Phase 1: Reproduce (2-3 weeks)
  1. Clone the Poetiq GitHub repo
  2. Set up API access (Gemini 3 recommended; GPT-5.1 as backup)
  3. Download ARC-AGI-2 public eval set
  4. Implement basic iterative solver with code generation
  5. Test on subset of problems (100 of 400)
  6. Measure accuracy and cost per problem
Phase 2: Optimize (4-6 weeks)
  1. Implement multi-model routing
  2. Build early termination logic
  3. Refine prompt templates through experimentation
  4. Test all 400 public problems
  5. Target: 45-50% accuracy, <$20/problem average
Phase 3: Extend (ongoing)
  1. Apply learnings to your custom reasoning tasks
  2. Contact Poetiq if enterprise partnership makes sense
  3. Monitor for ARC-AGI-3 release and papers
  4. Contribute improvements back to open-source
QUICK REFERENCE: TECHNICAL SPECS
Aspect
Details
SOTA Accuracy (ARC-AGI-2)
54% (Semi-Private, verified)
Cost (per problem)
$30.57 (includes overhead)
API Calls per Problem
<2 on average
Models Supported
12+ families (Gemini, GPT, Claude, Grok, open-source)
Code Base
GitHub (limited); full system via partnership
Team Size
6 researchers (Google DeepMind alumni)
Key Innovation
Meta-system orchestration + test-time reasoning
Open Source Level
Framework wrapper (not core optimization engine)
Contact for Enterprise
Publication Status
Blog posts & results verified; papers expected 2026
This is a genuinely significant breakthrough because it decouples intelligence from model scale. The practical implication: your organization doesn't need trillion-parameter models to solve complex reasoning tasks—you need smarter orchestration of existing ones. That's a paradigm shift for enterprise AI deployment.
0
0 comments
Guerin Green
4
poetiq:Technical Analysis for Implementation
Burstiness and Perplexity
skool.com/burstiness-and-perplexity
Master AI use cases from legal & the supply chain to digital marketing & SEO. Agents, analysis, content creation--Burstiness & Perplexity from NovCog
Leaderboard (30-day)
Powered by