There are so many parts to this that if I try to respond, I know I'll miss out on something, so I asked Cursor to give you a response. Words in { } are mine. I use a structured generation + validation pipeline, not freeform prompting. 1) "Why no Codex/Claude Code?" - For coding workflow, I do use an IDE agent setup (Cursor-style workflow) to build the system. { If you mean why I use Cursor over Claude, #1 is that I have a lot of customizations in this harness, and I like the Developer Experience. My initial trials of vanilla Claude Code didn't feel like I was getting better outcomes and I like the deep VS Code integration of Cursor, at the last time I tested Claude, I didn't get most of these } - For runtime question generation, I call models through a Python AI client with role-specific models (via LiteLLM + Instructor), so model choice is swappable per role. - Current default role split is: - generate: anthropic/claude-sonnet-4-6 - validate: anthropic/claude-opus-4-6 - vision QA: gpt-4.1 {can't seem to get later models for OpenAI, didn't spend too much time on it} - image generation: openai/dall-e-3 - So practically, this is not tied to one assistant UX; it is model-orchestrated infrastructure. 2) On DeepSeek for math - Agree it is worth testing. The architecture supports this easily because generation/validation models are config-driven. - I can A/B it in two places: - generation model (creative drafting quality + schema reliability) - validation/judge model (answer-consistency + pedagogical filtering) - I would not swap blindly; I'd run controlled evals on the same dataset/bank slice and compare pass rates by scorer, not just anecdotal output quality. 3) "What evaluation/guardrails are you doing?" This is where most of the work went. There are two layers: A) Production validation gate (used before questions are accepted) - Structural gate: - exactly 4 options (A/B/C/D) - exactly one correct answer key - duplicate/equivalent option detection