KIMI 2 - Drop
Here’s the summary of the technical details about Kimi K2 from Moonshot AI
📐 Architecture & Parameters
  • Mixture-of-Experts (MoE) architecture with 1 trillion total parameters, 32 billion active per token.
  • Features 384 experts, 8 active per token + 1 general expert.
  • 61 layers and 64 attention heads.
  • Uses SwiGLU activation functions.
🎯 Muon Optimizer / MuonClip
  • Employs Muon optimizer with MuonClip and “qk‑clip” to stabilize large-scale attention.
  • Trained on 15.5 trillion tokens, achieving 2× efficiency and 50% less memory usage.
🧠 Context Window
  • Supports a 128,000-token context window (~150–200 pages).
📊 Performance & Benchmarks
  • SWE‑bench (software engineering): 65.8% Pass@1.
  • LiveCodeBench: 53.7% Pass@1.
  • MATH‑500: 97.4% accuracy.
  • MMLU (language understanding): 89.5%.
  • Tau2 retail tasks: 70.6% Avg@4.
  • Also performed strongly in ZebraLogic, GPQA, etc.
🤖 Agentic Capabilities
  • Designed for agentic AI, with multi-step workflows and tool orchestration.
  • Trained using Large-Scale Agentic Data Synthesis, RL, and self-critique.
⚙️ Deployment & APIs
  • Available as Kimi‑K2‑Base (pre-trained) and Kimi‑K2‑Instruct (fine-tuned for chat/general use).
  • Context limits:
  • Estimated speed: ~200 tokens per second (Groq Cloud).
  • Accessible via Groq Cloud, SiliconFlow, OpenRouter, Together AI.
💾 Licensing & Open Source
  • Open weights and code under modified MIT license (requires citation “Kimi K2” for commercial use).
  • Can run locally (~1 TB weights) or in the cloud.
🧩 Advanced Features
  • Mooncake service: disaggregated KV‑Cache architecture for high throughput on long contexts.
  • Compatible with VS Code, Cline, Roo Code.
  • No native multimodal support (no Kimi‑VL or Kimi‑Audio yet).
8
5 comments
Nei E Maldaner
3
KIMI 2 - Drop
AI Automation Society
skool.com/ai-automation-society
A community built to master no-code AI automations. Join to learn, discuss, and build the systems that will shape the future of work.
Leaderboard (30-day)
Powered by