in both reasoning and non-reasoning models.. see attached benchmarks
Andrej Karpathy, former Director of AI at Tesla, wrote a lengthy post on X about his experience with early access to Grok3.
Here's the summary of the post:
- Thinking Capability: Grok 3 has an advanced thinking model on par with top OpenAI models, successfully handling complex tasks like creating a Settlers of Catan game webpage with dynamic hex grids. However, it failed at solving an emoji mystery involving Unicode variation selectors.
- Tic Tac Toe: Grok 3 could solve simple Tic Tac Toe puzzles but struggled with generating "tricky" boards, similar to OpenAI's o1-pro.
- Knowledge Retrieval: When tested with knowledge-based questions related to the GPT-2 paper, Grok 3 performed well, including estimating the computational cost of training GPT-2 without internet searching, which other models like o1-pro failed at.
- Riemann Hypothesis: Grok 3 attempted to solve the Riemann hypothesis, showing initiative by not giving up on challenging problems, unlike some other models.
- DeepSearch: This feature combines research capabilities with thinking, providing high-quality responses to various lookup questions. However, it had issues with referencing X as a source and occasionally hallucinated non-existent URLs.
- Humor: Grok 3's humor capability did not show improvement, failing to generate novel or sophisticated jokes, which is a common challenge for LLMs.
- Ethical Sensitivity: The model was overly sensitive to complex ethical issues, avoiding questions that might involve ethical dilemmas.
- SVG Generation: Grok 3 failed at generating an SVG of a pelican riding a bicycle, a test of spatial layout abilities, though it performed better than some but not as well as Claude.
Karpathy concluded that Grok 3, with its thinking capabilities, is around the state-of-the-art level, slightly outperforming models like DeepSeek-R1 and Gemini 2.0 Flash Thinking, considering xAI started from scratch about a year ago - an unprecedented achievement.