poetiq:Technical Analysis for Implementation

Dec '25 • Research

(Live build in the Hidden State Drift Mastermind)

Poetiq has achieved state-of-the-art (SOTA) performance on ARC-AGI-2 with 54% accuracy at $30.57 per problem—breaking the 50% barrier for the first time and surpassing average human performance (60% is typically human baseline). This represents a 9-point improvement over the previous SOTA (45% by Gemini 3 Deep Think) at less than half the cost($77.16 → $30.57).

Key Achievement Date: December 5, 2025 (officially verified by ARC Prize)

1. THE CORE INNOVATION: THE META-SYSTEM

What It Is

Poetiq's breakthrough is NOT a new foundation model. Instead, it's a meta-system that orchestrates existing frontier LLMs through:

Intelligent Multi-Agent Coordination - Multiple LLM "experts" that propose solutions, evaluate feedback, and self-audit
Test-Time Compute - Iterative reasoning and self-verification at inference time (not training time)
Adaptive Problem-Solving - Automatically selects which models, prompting strategies, and approaches (including code generation) for each specific problem
Cost Optimization - Achieves efficiency through intelligent early stopping and resource allocation

Fundamental Design Principles

"The prompt is an interface, not the intelligence"

Doesn't ask a single question; uses iterative loops
LLM generates proposed solution → receives feedback → analyzes → refines → repeats
Multi-step self-improving process builds and perfects answers incrementally

Self-Auditing

System autonomously decides when it has sufficient information
Monitors its own progress and terminates when solution is satisfactory
Minimizes wasteful computation

Why This Works for ARC-AGI-2

ARC-AGI-2 tests:

Abstract pattern recognition - "figure out the rule from 3 examples"
Fluid intelligence - NOT knowledge-based, requires true generalization
Spatial reasoning - Complex visual pattern relationships

The core problem: Raw frontier models score below human baseline because their stochasticity makes knowledge extraction unreliable. Poetiq's meta-system systematizes knowledge extraction for complex reasoning.

2. TECHNICAL ARCHITECTURE

The Reasoning Loop

Input Problem

↓

Generate Proposed Solution (often via code generation)

↓

Test Against Examples

↓

Receive Feedback/Error Output

↓

Analyze Failure Patterns

↓

Refine Hypothesis & Code

↓

Repeat Until Convergence OR Termination

↓

Output Final Solution

Multi-Model Coordination

The meta-system works across multiple LLMs:

Gemini 3 (a, b, c variants) - High accuracy variants with different compute budgets
GPT-5.1 - Latest OpenAI flagship
GPT-OSS-120B - Open-weights model (sub-1-cent cost per problem)
Claude Opus 4.5 - Anthropic's flagship
Grok 4 Fast Reasoning - xAI's reasoning model
12+ model families tested - GPT, Claude, Gemini, Grok, GPT-OSS

Key insight: The same meta-system adaptation works across model families and sizes with minimal modification—demonstrating robust generalization.

Code Generation as Core Capability

The system doesn't just use language—it writes and executes Python code:

Proposes a rule as executable code
Runs the code against provided examples
Detects failures via error feedback
Debugs and suggests new rules iteratively
Tests revised solutions in a self-contained loop

This is crucial for ARC-AGI because abstract reasoning benefits from formal, testable logic.

Efficiency Mechanism

Uses <2 requests on average per problem (ARC-AGI allows up to 2 attempts)

This efficiency means:

Fewer API calls = lower cost
Single, well-crafted solution
Intelligent early termination

3. PERFORMANCE METRICS (VERIFIED)

ARC-AGI-2 Semi-Private (Official Verification - Dec 5, 2025)

System

Accuracy

Cost/Problem

Efficiency

Poetiq (Gemini 3)

54%

$30.57

NEW SOTA

Gemini 3 Deep Think

45%

$77.16

Previous SOTA

Claude Opus 4.5

~50% (approx)

~$60/task

Cost inefficient

ARC-AGI-2 Public Eval Set

Poetiq systems establish entirely new Pareto frontier (better performance at every cost level):

Exceeds human average (60%) on public set
Multiple configurations for different cost targets
Gemini-3-b: Near-saturation on ARC-AGI-1; continued improvement on ARC-AGI-2

Cross-Model Performance

Same meta-system applied to 12+ models across families:

All show improvement in accuracy
All show reduction in cost vs baseline
Generalization confirmed across:Different model architecturesDifferent sizes (120B to frontier scale)Open and closed weights

Notable Low-Cost Results

GPT-OSS-120B (open weights): <1 cent per problem with competitive accuracy
Grok 4 Fast: Both cheaper AND more accurate than baseline reported numbers
Demonstrates that better reasoning ≠ bigger models

4. HOW TO ACCESS & IMPLEMENT

Option 1: Use the Official Open-Source Code

GitHub Repository: https://github.com/poetiq-ai/arc-agi-solver

What's Included:

Complete solver framework for ARC-AGI problems
Python-based implementation
Model integration points for multiple LLMs
Example configurations for different cost/performance targets

Limitations (Important):

GitHub contains the framework/wrapper code for ARC-AGI specifically
Does NOT include the proprietary meta-system optimization engine
You get the scaffolding to run tests, but not the core "how to automatically select approaches" logic
This is intentionally limited to show reproducibility without giving away core IP

Setup Requirements:

Python 3.9+
API access to frontier models (Gemini 3, GPT-5.1, Claude, Grok)
API keys configured in environment
ARC-AGI problem files (available from arcprize.org)

Option 2: Partner with Poetiq Directly

Contact: poetiq@poetiq.aiAlternative: hello@poetiq.ai

What They Offer:

Custom implementation of their full meta-system for your specific problems
Integration into existing larger systems (mentioned as capability)
Early partner status for deploying their technology
Consulting on task-specific optimization

Target Use Cases:

Organizations with complex reasoning tasks
Enterprise AI systems needing cost-efficient reasoning
R&D teams extending their capabilities

Option 3: Leverage Public Results

All ARC-AGI-2 results are on the public evaluation set and reproducible:

Download the public ARC-AGI-2 problem set from arcprize.org
Use the open-source Poetiq code with your choice of LLM API
Adapt the prompt strategies and iterative loops for your use case

5. IMPLEMENTATION CONSIDERATIONS FOR RESEARCHERS

Prerequisites

LLM API Access (minimum 1, ideally 2+):Gemini 3 Pro API (via Google AI/Vertex AI)GPT-5.1 API (via OpenAI)Claude API (via Anthropic)
Compute Budget:Per-problem cost ranges from <$0.01 to $30+Plan for thousands of API calls during developmentOptimize iteratively
Development Environment:Python 3.9+ recommendedJupyter for experimentationVersion control (GitHub)

Architecture You'll Need

Problem Parser - Convert ARC-AGI input format to internal representation
Prompt Engine - Generate context-aware prompts for each problem
Code Generator - LLM calls that output Python code solutions
Test Harness - Run generated code against examples, capture failures
Feedback Analyzer - Parse error messages and determine next steps
Termination Logic - Decide when solution is good enough
Model Router - Select which LLM to use for each task (or multi-model voting)
Cost Tracking - Monitor API spend and adjust strategies

Prompt Strategy Elements

Based on Poetiq's approach:

System Prompt:

- You are an expert at solving visual pattern recognition problems

- Generate Python code that defines the transformation rule

- Your code will be tested against examples

- If it fails, fix it based on error feedback

Few-shot Examples:

- 2-3 ARC problems with solutions

Chain-of-Thought:

- Describe what you observe in the examples

- Propose a hypothesis

- Write code to test it

- Explain expected output vs input

Error Handling:

- If code fails, analyze error

- Suggest alternative rules

- Try again (iteration loop)

Cost Optimization Strategies

Model Selection by DifficultySimple problems → GPT-OSS-120B (~$0.001)Medium problems → Grok 4 Fast (~$0.005)Hard problems → Gemini 3 / GPT-5.1 (~$0.01-0.10)
Early TerminationIf solution passes all examples on first try → stopIf code fails multiple times → escalate to better model
Batch ProcessingProcess similar-difficulty problems with same modelReuse successful prompts across related problems
Single-Attempt FocusAim for <1.5 API calls per problem averageBetter planning upfront = fewer retries

6. CRITICAL CAVEATS & LIMITATIONS

Public vs Semi-Private Performance Drop

The attached document reports 54% accuracy on Semi-Private eval (official verification). However:

ARC-AGI-1 shows significant drops from public to semi-private (varies by model)
ARC-AGI-2 semi-private and public are more closely calibrated (smaller drops)
When evaluating your implementation, expect 1-5% variance between test sets

What Hasn't Been Released

The meta-system's optimization logic - How it decides which approach to use
Exact prompt templates - Only general principles disclosed
Model selection heuristics - The "router" that picks between models
Training data for adaptation - They adapted on open-source models only, then transferred to frontier models

Generalization to Your Use Case

Poetiq states the system generalizes to "over a dozen different benchmarks" and "various reasoning and retrieval tasks," but:

Detailed results on non-ARC tasks NOT yet published
Different domains may require different prompt architectures
Your specific task may need substantial customization

7. COMPETITIVE LANDSCAPE

Who Else Is Competing

Gemini 3 Deep Think - Raw frontier model (45%, $77.16)
Claude Opus 4.5 - Alternative frontier model
Grok 4 Fast Reasoning - xAI's reasoning approach
o1-pro (OpenAI) - Estimated SOTA on public set via scoring

Why Poetiq's Approach Is Different

Doesn't fine-tune or update the models themselves
Works with whatever frontier model is released
Cost-efficient through orchestration, not scale
Provably transfers across model families

8. BUSINESS MODEL & AVAILABILITY

Current Status (December 2025)

Open Source: Limited (solver framework)Commercial: Available through partnershipTeam: 6 researchers/engineers, 53 years combined Google DeepMind experience

Paths to Implementation

DIY with open-source - Free, requires deep expertise
Partner directly - Contact poetiq@poetiq.ai for enterprise deployment
Replicate from papers - Wait for research papers (expected in 2026)

9. WHAT THIS MEANS FOR THE AI RESEARCH COMMUNITY

Key Insights

Intelligence is extractable, not just scale - Better reasoning through orchestration vs bigger models
Test-time compute matters - Iterative verification beats single-attempt generation
LLM-agnostic systems are possible - Same logic works across model families
Efficiency frontier is real - Can achieve better results for lower cost through smarter design

Implications for 2025-2026

ARC-AGI-3 launching early 2026 (interactive reasoning, different format)
Grand Prize remains unclaimed (>90% accuracy still possible)
Benchmark-specific optimization vs general reasoning capability debate ongoing
Frontier models continue releasing (Gemini 4 expected; GPT-6 rumors)

10. RECOMMENDED IMPLEMENTATION PATH FOR YOU

Phase 1: Reproduce (2-3 weeks)

Clone the Poetiq GitHub repo
Set up API access (Gemini 3 recommended; GPT-5.1 as backup)
Download ARC-AGI-2 public eval set
Implement basic iterative solver with code generation
Test on subset of problems (100 of 400)
Measure accuracy and cost per problem

Phase 2: Optimize (4-6 weeks)

Implement multi-model routing
Build early termination logic
Refine prompt templates through experimentation
Test all 400 public problems
Target: 45-50% accuracy, <$20/problem average

Phase 3: Extend (ongoing)

Apply learnings to your custom reasoning tasks
Contact Poetiq if enterprise partnership makes sense
Monitor for ARC-AGI-3 release and papers
Contribute improvements back to open-source

QUICK REFERENCE: TECHNICAL SPECS

Aspect

Details

SOTA Accuracy (ARC-AGI-2)

54% (Semi-Private, verified)

Cost (per problem)

$30.57 (includes overhead)

API Calls per Problem

<2 on average

Models Supported

12+ families (Gemini, GPT, Claude, Grok, open-source)

Code Base

GitHub (limited); full system via partnership

Team Size

6 researchers (Google DeepMind alumni)

Key Innovation

Meta-system orchestration + test-time reasoning

Open Source Level

Framework wrapper (not core optimization engine)

Contact for Enterprise

poetiq@poetiq.ai

Publication Status

Blog posts & results verified; papers expected 2026

This is a genuinely significant breakthrough because it decouples intelligence from model scale. The practical implication: your organization doesn't need trillion-parameter models to solve complex reasoning tasks—you need smarter orchestration of existing ones. That's a paradigm shift for enterprise AI deployment.

0 comments

Burstiness and Perplexity

skool.com/burstiness-and-perplexity

Master AI use cases from legal & the supply chain to digital marketing & SEO. Agents, analysis, content creation--Burstiness & Perplexity from NovCog

Leaderboard (30-day)