Hello All, this is a mixture of two AI outputs based on my personal projects AI operational workflow. Sharing as food for thought and discussion. There a lot here to read here. Not sure how well the formatting will hold up.
Any feedback welcome, lots more ideas to take this further in the works, but as of now its performing very well, even with lower models (in fact been testing it and improving it on using feedback from weak local models outputs to frontier models to adjust to save the same issues occurring again).
⭐PART 1 Operational
The approach is to make agent work explicit, bounded, and evidence-driven.
Agents are given:
- A map.
- A workflow.
- A small reading list.
- Known pitfalls.
- Specialist help.
- Verification gates.
- Review expectations.
- Examples of good output.
Humans get:
- More predictable agent behavior.
- Cleaner handoffs.
- Easier review.
- Less repeated explanation.
- A mechanism for turning mistakes into durable process improvements.
The most important shift is cultural: stop treating AI assistance as a chat transcript and start treating it as an operational system. The files, prompts, rubrics, checklists, and gotchas are the system. The agent is just the worker moving through it.
# Generic AI Agent Setup Overview
This document describes a reusable, high-level approach for setting up a repository so AI agents can work inside it reliably. It intentionally avoids product-specific, stack-specific, and implementation-specific details. The focus is the operating model: how agents discover context, choose the right workflow, avoid repeated mistakes, produce evidence, and hand work back to humans in a reviewable state.
The core idea is to treat AI agents like fast but context-limited contributors. They need a clear map of the repository, explicit rules for what they may and may not change, task-specific reading lists, examples of good finished work, and gates that force them to prove outcomes rather than merely assert confidence.
---
## 1. The Overall Philosophy
The setup is built around one principle:
> Do not rely on an agent remembering how the project works. Put the working agreement in files the agent can read, follow, and cite.
AI agents are strongest when they are given a small, precise context window and a concrete definition of done. They are weakest when every session starts with a vague prompt, scattered tribal knowledge, and no reliable way to distinguish "looks right" from "is verified."
This setup solves that by creating an explicit agent-readiness layer around the repository. That layer gives agents:
1. A root entry point that tells them where to start.
2. Local instructions near the code or documents they are changing.
3. Task-specific context packs so they do not read everything every time.
4. A workflow that moves from intent to acceptance criteria to implementation to verification.
5. Known gotchas and anti-patterns written in a format agents can quickly apply.
6. Specialist skills, prompts, and sub-agents for recurring task types.
7. A rubric that forces every non-trivial task to end with evidence.
8. Drift-prevention checks so instructions do not contradict each other over time.
The result is not "the AI can do anything." The result is "the AI has rails." It can move faster because it has boundaries.
---
## 2. The Documentation Layers
The setup is intentionally layered. Each layer has a distinct job, so instructions do not become one giant file that nobody can maintain.
### 2.1 Root Agent Router
At the root of the repository, there is a top-level agent instruction file. Its role is not to explain every detail. Its role is to route the agent.
A good root router answers:
- What kind of repository is this, at a very high level?
- What are the most important rules that apply everywhere?
- Where does an agent go for task-specific context?
- Which commands or checks define "done"?
- Which files or directories are safe to edit?
- Which files or directories should not be touched without explicit instruction?
- Where are decisions, gotchas, prompts, skills, and review gates documented?
The root router should be short enough that every agent can read it at the beginning of a session, but complete enough that it prevents dangerous assumptions.
The pattern is:
```text
Root agent file
-> points to area-specific instructions
-> points to context packs
-> points to gotchas
-> points to change-type gates
-> points to verification requirements
```
This gives agents a deterministic first step instead of making them search the repository blindly.
### 2.2 Area-Specific Agent Files
Each major area of the repository has its own local instruction file. These files contain rules that are too detailed for the root router but critical for that specific area.
Area-specific files typically include:
- The purpose of that area.
- Important boundaries.
- Local testing or verification commands.
- Naming conventions.
- Common failure modes.
- Links to relevant gotchas.
- Any local architectural or workflow constraints.
The useful pattern is proximity: put the instructions near the work. If an agent is editing a particular area, it should not need to infer local rules from unrelated files.
### 2.3 Agent Readiness Directory
There is a dedicated documentation area for agent operations. This is not normal product documentation. It is a playbook for how agents should orient themselves and execute work.
This directory contains things like:
- A quick start for new agent sessions.
- A map of major repository areas.
- Context packs for common task types.
- Boundary-change checklists.
- Handoff templates.
- Examples of completed work.
- Instruction ownership maps.
- Drift-prevention guidance.
This layer exists because the best prompt is not enough. Agents need reusable scaffolding. If every session requires restating the same rules, the setup is fragile.
### 2.4 Machine-Readable Maps
Some information is useful both to humans and scripts. For example:
- Area names.
- Paths.
- Relevant instruction files.
- Relevant gotchas.
- Local verification commands.
- Ownership boundaries.
That information can live in a machine-readable map, such as a YAML or JSON file. The key benefit is consistency. Scripts can validate it, agents can read it, and humans can update one source of truth instead of copying tables across documents.
The rule is simple:
> If the same list appears in more than one place, consider making it machine-readable and linking to it.
### 2.5 Instruction Ownership Map
As the agent setup grows, the same rule can accidentally appear in multiple files with slightly different wording. This creates instruction drift.
An instruction ownership map prevents that by declaring which file is the canonical source for each recurring rule.
For example, it can define:
- The root router owns global rules.
- The context-pack file owns task-type reading lists.
- The change-type matrix owns gate applicability.
- The rubric owns scoring.
- The gotchas index owns known pitfalls.
- The custom-agent directory owns sub-agent behavior.
- The pull request template owns final evidence requirements.
Secondary files should summarize or link to the canonical source rather than duplicating the rule in full.
This matters because agents are literal. If two files disagree, the agent may choose the wrong one, merge them badly, or waste time resolving contradictions.
---
## 3. Context Packs: Small Reading Lists for Specific Tasks
One of the most important parts of the setup is the use of context packs.
A context pack is a short, task-specific reading list. Instead of telling agents to read the whole repository, the pack says: "For this type of task, read these files first."
Examples of task types that can have context packs:
- New feature.
- Bug fix.
- User interface change.
- API or contract change.
- Data migration.
- Configuration change.
- Shared library change.
- Documentation-only change.
- Debugging a test failure.
- Security-sensitive change.
- Release or deployment preparation.
Each context pack should include:
- Required files to read.
- Optional files to read only if relevant.
- Skills or specialist agents to invoke.
- Commands or checks usually required.
- Known gotchas for that task type.
The purpose is to reduce token waste and improve focus. Agents do not need every document for every task. They need the right documents at the right time.
This also makes the setup teachable. A human assigning work can say, "Use the API-change context pack," and the agent knows what that means.
---
## 4. Gotchas as a Living Knowledge Base
The setup includes a gotchas library: a living record of mistakes that have happened before and how to avoid them.
The most useful format is:
```markdown
## Mistake name
**BAD:** What the wrong approach looks like.
**GOOD:** What the correct approach looks like.
**BECAUSE:** Why the difference matters.
```
This format works well for agents because it is explicit. It does not just say "be careful." It shows the failure mode, the desired behavior, and the reasoning.
Gotchas can cover:
- Testing mistakes.
- Workflow mistakes.
- Boundary mistakes.
- Security mistakes.
- Documentation mistakes.
- Tooling mistakes.
- Review mistakes.
- Repeated false assumptions.
The gotchas library should be indexed so agents can quickly find entries by area or task type.
The important practice is to add a gotcha when a real mistake is discovered. Do not rely on memory. If an agent or human hits the same issue twice, it belongs in the gotchas library.
---
## 5. Skills and Specialist Agents
The setup distinguishes between general agent behavior, skills, and specialist agents.
### 5.1 Skills
Skills are reusable operating modes or playbooks. They tell the current agent how to approach a class of work.
Examples of useful skills:
- Planning and task breakdown.
- Specification writing.
- Acceptance-test design.
- Test-driven development.
- Debugging and error recovery.
- Security review.
- API/interface design.
- Frontend or user-interface engineering.
- Documentation and decision records.
- Code simplification.
- Performance optimization.
- Release preparation.
The benefit of a skill is consistency. Instead of improvising a workflow every time, the agent invokes the relevant skill and follows its checklist.
### 5.2 Specialist Agents
Specialist agents are separate agent roles used for focused tasks. They should have narrow responsibilities and strong instructions.
Examples:
- A code reviewer that only reviews and does not edit.
- A security auditor that focuses on threats and vulnerabilities.
- A test engineer that evaluates test strategy and coverage.
- A documentation pruner that finds stale docs after changes.
Specialist agents are most useful when they produce independent review. They reduce the chance that the same agent that wrote the change also rubber-stamps it.
### 5.3 Skill Selection Guide
A skill selection guide maps task intent to the right skill or specialist agent.
For example:
| When the task is... | Use... |
|---|---|
| Planning work | Planning skill |
| Writing acceptance criteria | Acceptance-testing skill |
| Changing behavior | TDD skill |
| Reviewing before merge | Review agent |
| Security-sensitive | Security skill or auditor |
| Documentation or decisions | Documentation skill |
This guide prevents agents from guessing which process applies.
---
## 6. Chat Modes and Reusable Prompts
The setup includes reusable chat modes and prompt templates.
### 6.1 Chat Modes
Chat modes define the broad posture of the agent. Common modes include:
- Planning mode.
- Implementation mode.
- Review mode.
- Debugging mode.
The point is to make the agent's behavior match the phase of work.
Planning mode should not start editing code or documents unless asked. Implementation mode should persist until the task is working. Review mode should look for risks and evidence gaps. Debugging mode should follow symptoms to root cause instead of guessing.
### 6.2 Prompt Templates
Prompt templates are starter prompts for recurring work.
Useful templates include:
- Add a new feature.
- Debug a failing test.
- Add or change an endpoint.
- Add or change an event.
- Add or change configuration.
- Add or change a migration.
The template should ask for:
- Goal.
- Scope.
- Out of scope.
- Relevant area.
- Acceptance criteria.
- Expected tests or checks.
- Known gotchas.
- Required evidence.
Prompt templates make human-to-agent handoff much more reliable. They reduce ambiguity before work begins.
---
## 7. The Workflow Model
The workflow is designed to keep agents from jumping straight to implementation.
A generic version looks like this:
```text
Intent
-> Plan
-> Acceptance criteria
-> Failing test or proof of gap
-> Implementation
-> Passing tests
-> Refactor
-> Targeted quality checks
-> Full verification
-> Review
-> Handoff with evidence
```
The exact commands vary by repository. The important part is the sequence.
### 7.1 Intent
The agent starts by understanding the task:
- What is the user asking for?
- Is this a change, investigation, review, or plan?
- Which area is affected?
- Which context pack applies?
- Is clarification needed before editing?
### 7.2 Plan
For non-trivial work, the agent creates or follows a plan. The plan should identify:
- The problem.
- The proposed approach.
- The affected areas.
- The likely tests or checks.
- Any risk or boundary being touched.
The plan does not need to be long. It needs to be explicit enough that the agent can avoid wandering.
### 7.3 Acceptance Criteria
Before implementation, the expected behavior should be described in user-facing or outcome-facing terms.
This matters because agents can otherwise implement whatever is easiest to test, not necessarily what the user asked for.
Acceptance criteria should answer:
- What should be true when the task is complete?
- What failure or edge case must be handled?
- What behavior must not change?
- How will the result be verified?
### 7.4 Test-First or Evidence-First Development
For behavior changes, the setup requires proof that the old behavior failed before the new behavior passed.
This can be:
- A failing acceptance test.
- A failing unit test.
- A reproduced bug.
- A failing build or check.
- A documented gap in a docs-only task.
The agent should not merely say "this would have failed." It should capture real evidence where practical.
### 7.5 Implementation
Implementation should be incremental and scoped. Agents are instructed to:
- Make surgical changes.
- Avoid unrelated refactors.
- Preserve existing behavior.
- Respect local instructions.
- Reuse existing patterns before adding new ones.
- Avoid unsafe shortcuts.
- Avoid silently swallowing errors.
- Avoid broad assumptions.
The setup biases toward maintainability rather than the quickest patch.
### 7.6 Green Phase
After implementation, the relevant tests or checks must pass.
The important distinction is:
> Passing evidence beats confidence.
An agent should not claim completion based on code inspection alone when a check exists.
### 7.7 Refactor
Once behavior is green, cleanup is allowed. Refactoring should be behavior-preserving and should not blur the evidence trail.
The setup discourages mixing large formatting changes with behavior changes because it makes review harder.
### 7.8 Verification
Before handoff, the agent runs the required verification gates for the change type. These might include tests, linting, type checks, security scans, acceptance checks, documentation checks, or full repository verification.
The exact tools are repository-specific. The pattern is generic:
- Small checks while iterating.
- Broader checks before handoff.
- Full verification before merge or release.
### 7.9 Review
Non-trivial work should receive a review pass. This can be done by a specialist review agent or by a human.
Review should focus on:
- Correctness.
- Maintainability.
- Boundary violations.
- Security implications.
- Missing tests.
- Incomplete evidence.
- Documentation drift.
### 7.10 Handoff
The final response or pull request should include what changed and the evidence that proves it.
For agents, the handoff is part of the work. A change that cannot be reviewed is not really done.
---
## 8. Change-Type Gates
Different changes need different gates. A documentation-only change should not require the same checks as a security-sensitive behavior change.
The setup uses a change-type matrix to define which gates apply.
Generic change types might include:
- Documentation-only.
- User interface only.
- Domain or business logic.
- API or contract.
- Security-sensitive.
- Data migration.
- Infrastructure or adapter.
- Shared library.
- Tooling or CI.
- Configuration.
For each change type, the matrix answers:
- Are acceptance tests required?
- Are unit tests required?
- Is mutation or robustness testing required?
- Is an architectural decision record required?
- Are contract tests required?
- Is a specialist review required?
- Is full verification required?
The matrix prevents two common problems:
1. Over-testing trivial changes.
2. Under-testing risky changes.
It also gives reviewers a stable way to challenge missing evidence.
---
## 9. Evidence Tiers
The setup separates gates into enforcement tiers.
### 9.1 Automated Gates
These are checks that run in automation and can block a merge.
Examples:
- Formatting.
- Linting.
- Type checking.
- Unit tests.
- Acceptance tests.
- Required file existence.
- Basic documentation structure.
Automated gates are valuable because they do not depend on memory.
### 9.2 Process-Enforced Gates
These are required but not always easy to enforce automatically.
Examples:
- Full local verification.
- Mutation testing.
- Contract testing against dependent systems.
- Manual browser checks.
- Evidence pasted into a pull request.
The key is to require real output, not vague claims.
### 9.3 Honour-System Gates
Some important practices cannot be fully proven after the fact.
Examples:
- Acceptance criteria were written before implementation.
- A true red/green cycle happened.
- The agent chose the correct context pack.
- The ADR decision was considered at the right time.
- The work was reviewed thoughtfully.
These gates rely on discipline and review. They should still be documented because naming them makes them inspectable.
---
## 10. The Workflow Rubric
The workflow rubric is a scorecard appended to non-trivial task summaries. Its job is to force explicit evidence.
The rubric typically includes phases such as:
- Planning.
- Decision records.
- Acceptance criteria.
- Test-first proof.
- Green test proof.
- Refactor.
- Acceptance verification.
- Mutation or robustness checks.
- Contract checks.
- Review.
- Quality checks.
- Security checks.
- Git hygiene.
- Full verification.
- Documentation updates.
- Post-task reflection.
Each row has:
- An ID.
- A phase.
- A gate.
- A status.
- Evidence.
Statuses are typically:
- Met.
- Partial.
- Missed.
- Not applicable.
The rubric should define mandatory gates. If a mandatory gate is missed, the work is not ready, regardless of the overall score.
This is especially useful with agents because it prevents polished but unsupported final messages. The agent must show the difference between "done," "not applicable," and "not done."
---
## 11. Done Examples
The setup includes examples of completed work.
These examples show what "good" looks like for different task types:
- A feature with acceptance criteria and tests.
- A user interface change.
- A bug fix with regression proof.
- A documentation-only change.
- A boundary or contract change.
Each done example should include:
- What was changed.
- What evidence was captured.
- Which gates were applicable.
- Which gates were not applicable and why.
- What the final handoff looked like.
This is calibration. Agents perform better when they can compare their output to a concrete example instead of abstract instructions.
---
## 12. Boundary-Change Checklists
Boundary changes are risky because they affect more than the file being edited.
Examples of boundaries:
- Public interfaces.
- Internal interfaces between modules.
- Events or messages.
- Data schemas.
- Authentication or authorization behavior.
- Configuration.
- Deployment or runtime assumptions.
- Shared libraries.
Boundary-change checklists remind agents to update all connected surfaces.
A checklist might ask:
- Does this need a decision record?
- Does documentation need to change?
- Are downstream callers affected?
- Are tests needed on both sides of the boundary?
- Is there a compatibility or migration path?
- Does the change need a new contract test?
- Does any generated schema or documentation need updating?
- Does a reviewer need to inspect security or backward compatibility?
This reduces the chance that an agent changes one side of an interface and forgets the other.
---
## 13. Decision Records
The setup uses decision records for meaningful architectural or workflow decisions.
A decision record should explain:
- Context.
- Decision.
- Alternatives considered.
- Consequences.
- Status.
- Date or sequence.
The point is not bureaucracy. The point is to prevent the same decision from being relitigated every few months by humans or agents who lack historical context.
Agents benefit from decision records because they can read why a pattern exists. Without that, they often "improve" code by undoing intentional trade-offs.
Decision records should be indexed. Old decision records should not be deleted when superseded; they should be marked as superseded and linked to the newer decision.
---
## 14. Human-to-Agent Handoff
The setup includes a task brief template for assigning work to agents.
A good task brief contains:
- The goal.
- The reason for the work.
- In-scope items.
- Out-of-scope items.
- Relevant files or areas.
- The context pack to use.
- Acceptance criteria.
- Required tests or checks.
- Known gotchas.
- Any constraints or preferences.
- Expected final evidence.
This avoids the common problem where a human gives a short prompt and the agent fills in the blanks incorrectly.
The brief does not need to be long for every task. It needs to remove ambiguity where ambiguity would change the implementation.
---
## 15. Agent-to-Agent Handoff
For larger work, one agent may plan, another may implement, another may review, and another may prune documentation.
This only works if handoffs are explicit.
An agent-to-agent handoff should include:
- Current state.
- Completed steps.
- Remaining steps.
- Files changed.
- Tests run.
- Known failures.
- Decisions made.
- Open questions.
- What not to touch.
The receiving agent should not have to reconstruct the full history from chat logs.
---
## 16. Review Model
The review model treats agent output as useful but not inherently trusted.
Review asks:
- Did the agent follow the right context pack?
- Did it read local instructions?
- Did it avoid unrelated changes?
- Did it update every affected surface?
- Did it produce real evidence?
- Did it mark non-applicable gates honestly?
- Did it preserve boundaries?
- Did it avoid inventing facts?
- Did it document new gotchas?
Specialist review agents can help, but humans remain responsible for accepting risk.
The most important review habit is to inspect evidence before implementation details. If the workflow evidence is missing, the implementation may be premature no matter how good it looks.
---
## 17. Drift Prevention
An agent setup can decay. Rules get copied, files move, new task types appear, old gotchas become stale, and prompts stop matching reality.
This setup counters drift with:
- A canonical instruction map.
- Machine-readable area metadata.
- A gotchas index.
- A skill selection guide.
- A change-type matrix.
- A workflow rubric.
- Consistency checks for agent documentation.
- A rule that secondary docs link to canonical sources instead of duplicating them.
Drift prevention should be part of normal maintenance. Whenever agent instructions change, run whatever validation exists for the agent docs.
Useful checks include:
- Every area listed in the map has an instruction file.
- Every gotcha file is indexed.
- Every context pack links to existing files.
- Every custom agent has required metadata.
- Every chat mode has required metadata.
- The pull request template and rubric agree.
- Repeated rules point to the same canonical source.
---
## 18. How a New Agent Session Starts
A generic new-session flow looks like this:
1. Read the root agent router.
2. Identify the task type.
3. Choose the relevant context pack.
4. Read the area-specific instructions.
5. Read the relevant gotchas.
6. Check the change-type matrix.
7. Decide whether clarification is needed.
8. Create or follow a plan.
9. Execute the workflow.
10. Fill the rubric and hand off evidence.
This sequence is simple, but it matters. It makes agent work repeatable.
Without this sequence, each session becomes a fresh negotiation between the user's prompt, the agent's assumptions, and whatever files the agent happens to read.
---
## 19. How Humans Use the Setup
Humans use the setup in three ways.
### 19.1 Assigning Work
When assigning work, the human points the agent to:
- The task brief.
- The context pack.
- The affected area.
- The required evidence.
This gives the agent a strong start.
### 19.2 Reviewing Work
When reviewing, the human checks:
- The rubric.
- The test or verification evidence.
- The changed files.
- The relevant gotchas.
- The change-type gates.
- Any decision records.
This turns review from "does this look okay?" into "does this satisfy the agreed workflow?"
### 19.3 Improving the System
When something goes wrong, the human updates the agent setup:
- Add a gotcha.
- Update a context pack.
- Clarify a local instruction.
- Add a checklist item.
- Add a done example.
- Improve the prompt template.
- Add a validation check.
The goal is that every repeated mistake becomes harder to repeat.
---
## 20. How to Recreate This Pattern in Another Repository
To apply this approach elsewhere, start small. Do not create a huge agent framework on day one.
### Step 1: Add a Root Agent Router
Create a root file that tells agents:
- What the repository is.
- Where to start.
- What rules always apply.
- How to verify work.
- What not to touch.
- Where deeper instructions live.
### Step 2: Add Local Instruction Files
Add local instruction files for major areas. Keep them close to the files they govern.
Each local file should answer:
- What is special about this area?
- What are the common commands?
- What patterns should be followed?
- What mistakes should be avoided?
### Step 3: Add Context Packs
Create task-type reading lists. Start with the top five task types your agents perform most often.
For each task type, list:
- Must-read files.
- Optional files.
- Required checks.
- Recommended skills or specialist agents.
### Step 4: Add a Gotchas Library
Create a gotchas directory and index. Add entries only for real mistakes or high-risk patterns.
Use the BAD / GOOD / BECAUSE format.
### Step 5: Add a Change-Type Matrix
Define which gates apply to which kinds of changes.
This removes ambiguity about whether a task needs tests, decision records, contract checks, or full verification.
### Step 6: Add a Workflow Rubric
Create a scorecard that agents must fill for non-trivial work.
Make sure it includes mandatory gates and evidence fields.
### Step 7: Add Prompt Templates
Add reusable prompts for recurring work. Keep them practical and fill-in-the-blank.
### Step 8: Add Specialist Agents
Only add specialist agents when a recurring review or task type benefits from a separate role.
Start with:
- Code reviewer.
- Security reviewer.
- Test reviewer.
- Documentation reviewer.
### Step 9: Add Drift Checks
As the setup grows, add scripts or checks that verify the docs stay aligned.
At minimum, validate links, indexes, required metadata, and repeated-rule ownership.
### Step 10: Iterate From Real Failures
Do not try to predict every possible rule. Let the system evolve from actual mistakes.
When a mistake happens:
- Fix the work.
- Add or update the relevant instruction.
- Add a gotcha if it is likely to recur.
- Add a check if it can be automated.
---
## 21. Suggested Generic Folder Structure
A generic version of this setup might look like:
```text
repo-root/
AGENTS.md
docs/
agents/
README.md
context-packs.md
instruction-map.md
area-map.yml
boundary-change-checklists.md
task-brief-template.md
done-examples/
feature.md
bug-fix.md
docs-only.md
ui-change.md
gotchas/
INDEX.md
workflow-gotchas.md
testing-gotchas.md
security-gotchas.md
decisions/
INDEX.md
0001-example-decision.md
workflow-rubric.md
change-type-matrix.md
.github/
agents/
code-reviewer.md
security-auditor.md
test-engineer.md
doc-pruner.md
chatmodes/
plan.chatmode.md
implement.chatmode.md
review.chatmode.md
debug.chatmode.md
prompts/
new-feature.prompt.md
debug-failure.prompt.md
add-endpoint.prompt.md
instructions/
language-or-area-specific.instructions.md
SKILLS.md
```
The exact paths do not matter. The separation of concerns does.
---
## 22. What This Setup Optimizes For
This setup optimizes for:
- Fast orientation.
- Consistent behavior across sessions.
- Reduced context waste.
- Safer changes.
- Reviewable evidence.
- Lower instruction drift.
- Better human-agent collaboration.
- Capturing lessons learned.
- Preventing repeated mistakes.
- Making "done" explicit.
It does not optimize for:
- Minimal documentation.
- Maximum agent autonomy without checks.
- One-shot prompting.
- Blind trust in generated changes.
- Hiding workflow complexity.
The point is not to make agents feel magical. The point is to make them dependable.
---
## 23. Common Anti-Patterns This Avoids
### "Just read the whole repo"
This wastes context and makes the agent less focused. Context packs are better.
### "The agent will remember"
It will not reliably remember across sessions. Put rules in files.
### "The final answer says it passed"
Claims are not evidence. Require command output or equivalent proof.
### "We only need one instruction file"
One huge instruction file becomes stale and hard to navigate. Layered instructions are easier to maintain.
### "Every rule belongs everywhere"
Duplicated rules drift. Use an instruction ownership map.
### "All changes need the same gates"
This either overburdens small changes or under-protects risky ones. Use a change-type matrix.
### "Gotchas are obvious"
They are obvious after they bite you. Write them down.
### "Review the diff first"
Review the evidence first. If the workflow is missing, the diff may be premature.
---
## 24. The Most Important Practices
If you only copy a few parts of this setup, copy these:
1. A root agent router that tells agents where to start.
2. Context packs that keep sessions focused.
3. A gotchas library written as BAD / GOOD / BECAUSE.
4. A change-type matrix that defines required gates.
5. A workflow rubric with evidence fields.
6. Local instruction files near the work.
7. A skill or specialist-agent guide.
8. Done examples that show the expected standard.
9. Drift-prevention rules so instructions stay aligned.
Together, these create an operating system for AI-assisted development.
---
⭐ PART 2 - Workflow (things have progressed since this was written)
# AI Agent Workflow Setup
---
## Executive Overview
Most codebases treat AI agents as smart autocomplete. This project treats them as **disciplined engineers with a full onboarding programme, a quality system, and a performance scorecard.**
The operational setup answers three questions every agent session must resolve:
1. **What am I allowed to do?** — Layered instruction files load automatically based on what is being touched. Rules carry explicit reasons, not just mandates.
2. **How do I do it correctly?** — A prescribed workflow with evidence requirements at every phase. Skills and agents are assigned to specific tasks, not left to improvisation.
3. **How do I prove I did it right?** — A scored rubric is appended to every task. CI validates what machines can check. Humans validate the rest. The gap between the two is honestly documented.
The result: agents produce production-ready work consistently, follow the same standards as human contributors, and improve the system every time they make a mistake.
---
## Contents
1. [Structural Foundation](#1-structural-foundation)
2. [AI Instruction System](#2-ai-instruction-system)
3. [Knowledge Management](#3-knowledge-management)
4. [The Workflow](#4-the-workflow)
5. [Quality & Enforcement System](#5-quality--enforcement-system)
6. [Skills & Agent Library](#6-skills--agent-library)
7. [Step-by-Step: PRD to Merged PR](#7-step-by-step-prd-to-merged-pr)
8. [CI Pipeline](#8-ci-pipeline)
9. [Meta-Design Principles](#9-meta-design-principles)
---
## 1. Structural Foundation
The codebase is a monorepo of isolated services. Each service owns its own code, its own database, its own tests, and its own acceptance tests. Services communicate only over defined interfaces — no direct imports between them. Shared code lives in a single shared library.
This isolation is not just an architectural choice — it is an AI operational choice. An agent working on one service cannot accidentally break another. Boundaries are enforced by automated tooling, not convention.
Each service follows a consistent internal structure:
```
service/
├── domain/ # Pure business logic — no framework imports
├── infrastructure/ # External adapters (DB, APIs, messaging)
├── api/ # Route handlers and request/response schemas
├── tests/ # Unit tests
└── features/ # Acceptance tests (Gherkin .feature files)
```
This predictability means an agent can orient in any service without needing bespoke guidance for each one.
---
## 2. AI Instruction System
Instructions load in layers — from broad project-wide rules down to file-pattern-specific conventions. An agent working on a Python backend file gets different active instructions than one working on a frontend component.
| Layer | What it covers |
|-------|----------------|
| **Project-wide system prompt** | Architecture rules, Always/Never mandates, workflow cycle, key commands |
| **Pattern instructions** | Language- and file-type-specific conventions, activated by glob pattern |
| **Root agent doc (`AGENTS.md`)** | Full architecture, service boundaries, domain model, JWT schema, environment variables |
| **Per-service agent docs** | Service-specific domain detail, simulation mechanics, app state, gotcha references |
| **Chatmodes** | Pre-configured interaction modes: `plan`, `implement`, `review`, `debug` — each loads the right mental model |
### Always/Never Rules with BECAUSE Clauses
Every rule carries an explicit reason:
```
Always use X, never Y BECAUSE without X, Z breaks in production
Never do A BECAUSE A causes B which is silent and hard to debug
```
This prevents agents from treating rules as arbitrary and working around them when they seem inconvenient.
### Task-Specific Starter Prompts
Pre-built prompts for common task types pre-load the right context before the agent begins:
- Adding a new endpoint → triggers gateway proxy checklist
- Adding a messaging event → triggers event envelope and topic conventions
- Adding a database migration → triggers migration patterns and rollback requirements
- Adding a config flag → triggers naming rules and fail-closed behaviour
- Debugging a test failure → triggers gotcha scan first
---
## 3. Knowledge Management
### Architecture Decision Records
Every significant architectural decision is recorded with context, decision, and consequences. There are 40+ ADRs covering infrastructure choices, security model, testing strategy, enforcement taxonomy, and agent operational design.
Agents read the ADR index before starting work. They know *why* decisions were made, not just *what* they are — which prevents well-intentioned agents from reversing good decisions.
### Gotchas Library
A living library of real bugs and process mistakes, each documented in BAD/GOOD/BECAUSE format. Organised by service and topic. Indexed for fast lookup.
Coverage includes: test setup patterns, migration pitfalls, authentication quirks, messaging edge cases, frontend state bugs, CI infrastructure issues, and workflow process anti-patterns.
> New mistakes are filed as gotcha entries. No agent makes the same mistake twice.
### Agent Readiness Layer
A dedicated `docs/agents/` directory built specifically to reduce per-session orientation overhead:
| File | Purpose |
|------|---------|
| `context-packs.md` | Pre-built reading lists per task type — agents load only what they need |
| `service-map.yml` | Machine-readable index of all services: paths, ports, commands, gotcha references |
| `boundary-change-checklists.md` | Step-by-step ordered checklists for the six most error-prone change types |
| `task-brief-template.md` | Structured handoff template for human-to-agent and agent-to-agent work |
| `done-examples/` | Reference examples of properly completed tasks — calibrates what "done" looks like |
---
## 4. The Workflow
### The Canonical Cycle
```
Vision → Plan → ATDD → TDD → Acceptance Green → Mutate → Verify → Ship → Doc-Prune
```
Every step is defined, gated, and produces evidence.
### Incremental by Design
Work is delivered in small, verifiable increments:
```
spec → acceptance test → failing unit test → implement → green → refactor → mutate → commit
```
Each increment is complete before the next begins. Formatting changes never mix with behaviour changes.
### Change-Type Matrix
A matrix maps every change type to exactly which quality gates apply. This prevents both over-gating (wasting effort) and under-gating (missing required checks):
| Change Type | Acceptance Tests | Mutation | ADR | Contract Tests |
|-------------|:---:|:---:|:---:|:---:|
| Docs only | — | — | — | — |
| Frontend UI | — | — | Rare | — |
| Backend domain logic | ✅ | ✅ | Maybe | Maybe |
| API contract change | ✅ | ✅ | ✅ | ✅ |
| Security change | ✅ | ✅ | ✅ | ✅ |
| DB migration | ✅ | Optional | Rare | — |
| Shared library | ✅ | ✅ | Maybe | Maybe |
---
## 5. Quality & Enforcement System
### Three-Tier Enforcement Taxonomy
The project explicitly documents what each layer of enforcement can and cannot catch:
| Tier | Symbol | Meaning | Examples |
|------|--------|---------|---------|
| **CI-Automated** | 🤖 | Runs on every PR, blocks merge | Lint, typecheck, unit tests, acceptance tests, `.feature` file existence |
| **Process-Enforced** | 👤 | Required but manually triggered — results pasted into PR | Mutation kill rates, contract tests, local full-verify |
| **Honour System** | ✋ | Unenforceable by machine — agent discipline + human review | Acceptance tests written before code, RED/GREEN proof ordering |
> This honest taxonomy prevents "trust drift" — the dangerous failure mode where agents assume CI will catch what it won't, and take shortcuts accordingly.
### The Workflow Rubric
A scored scorecard appended to every non-trivial task summary. 18 gates across 8 phases:
| Phase | Gates cover |
|-------|-------------|
| Planning | Plan existed before coding · ADR written |
| ATDD | `.feature` files before implementation · happy path + failure scenarios |
| Strict TDD | RED proof · GREEN proof · refactor + mutation per unit |
| Acceptance | `make test-acceptance` passes |
| Mutation | Kill rate thresholds met per layer |
| Contracts | Contract tests run if API/auth changed |
| Quality | Lint/typecheck/security/arch · coverage threshold |
| Git & Verify | Pre-commit passes · `make verify` passes |
| Docs | Module docstrings · gotchas filed · doc-pruner run |
**Scoring:** ✅ Met (2pts) · ⚠️ Partial (1pt) · ❌ Missed (0pts) · N/A excluded
**Traffic light:** 🟢 All mandatory gates pass, score ≥80% · 🟡 No mandatory failures but score <80% · 🔴 Any mandatory gate fails — PR must not open
**Six mandatory gates (🔴):** Acceptance tests written before code · RED phase proven · GREEN phase proven · `make test-acceptance` passes · pre-commit passes · `make verify` passes
### Mutation Testing
Tests are validated, not just counted. Mutation testing confirms the test suite actually catches real bugs — not just that it runs and passes. Thresholds are set per layer:
- Domain (core business logic): ≥80% kill rate
- Infrastructure and API layers: ≥60% kill rate
Targeted mutation runs after each TDD unit. A full sweep runs before the PR opens.
---
## 6. Skills & Agent Library
### Skills (23 available)
Reusable behavioural patterns invokable by the agent for specific task types:
| When | Skill |
|------|-------|
| Writing acceptance tests | `atdd-specialist` *(mandatory — pre-loaded with project patterns)* |
| TDD with proof output | `strict-tdd` *(mandatory — standard TDD without proof is not accepted)* |
| Designing APIs or module interfaces | `api-and-interface-design` |
| Breaking work into ordered tasks | `planning-and-task-breakdown` |
| Implementing incrementally | `incremental-implementation` |
| Debugging test failures | `debugging-and-error-recovery` |
| Building UI components | `frontend-ui-engineering` |
| Hardening against vulnerabilities | `security-and-hardening` |
| Recording architectural decisions | `documentation-and-adrs` |
| Reviewing code before merge | `code-review-and-quality` |
| Capturing decisions after a task | `post-task-reflection` |
| Preparing for production release | `shipping-and-launch` |
### Custom Agents
Specialist autonomous agents dispatched for specific phases:
| Agent | Role |
|-------|------|
| `code-reviewer` | Multi-axis review: correctness, readability, architecture, security, performance |
| `doc-pruner` | Scans all documentation for stale references after a feature ships |
| `security-auditor` | Threat modelling, vulnerability detection, secure coding review |
| `test-engineer` | Test strategy design and coverage analysis |
### Rubber-Duck Validation
Before implementing any non-trivial plan, the agent submits it to an independent **rubber-duck agent** for critique. This catches design flaws while course corrections are cheap — before code is written. Used reactively too, when test failures or repeated errors suggest the current approach is wrong.
---
## 7. Step-by-Step: PRD to Merged PR
### Phase 1 — Orient
1. Read project-wide agent doc and target service agent doc
2. Run `make explain-gates` → confirms which quality gates apply to this specific diff
3. Load the matching context pack → reads only relevant gotchas and prior decisions
4. Check the ADR index → confirms no prior decision contradicts the plan
### Phase 2 — Plan
5. Break the PRD into ordered, verifiable tasks
6. If architectural → write the ADR **now**, before any code
7. Submit the plan to the rubber-duck agent → resolve design flaws before building
### Phase 3 — ATDD First
8. Invoke `atdd-specialist` skill
9. Write `.feature` files in Gherkin — happy path and at least one failure scenario
10. Commit `.feature` files before any implementation code ✋
### Phase 4 — TDD Cycle *(repeats per unit)*
11. Invoke `strict-tdd` skill
12. Write a failing unit test → run suite → capture **RED output** (proof of failure)
13. Write minimum code to pass → run suite → capture **GREEN output** (zero regressions)
14. Refactor → run targeted mutation against that module → fix surviving mutants
15. Repeat for each unit in the plan
### Phase 5 — Acceptance Green
16. Run `make test-acceptance` → all `.feature` scenarios must pass
17. Any failures → loop back to Phase 4
18. All green → the PRD requirement is formally satisfied
### Phase 6 — Quality & Review
19. Run `make check` → lint, typecheck, security scan, import boundary validation
20. Run `pre-commit run --all-files` → local git gate
21. Dispatch `code-reviewer` agent → resolve all Critical findings
22. If API/auth/messaging headers changed → run contract tests 👤
### Phase 7 — Full Verify
23. Run `make verify` → the complete pre-PR gate (quality + tests + acceptance + mutation)
24. Fill in the 18-gate Workflow Rubric scorecard with evidence for every gate
25. Traffic light must be 🟢 before the PR opens
### Phase 8 — Pull Request
26. Open PR using the project template
27. Paste into PR description: rubric scorecard, mutation results, contract test results, ADR reference
28. CI pipeline triggers automatically
### Phase 9 — Post-Ship
29. Run `doc-pruner` agent → remove stale references caused by anything renamed or removed
30. File any new mistakes discovered as gotcha entries
31. Run `post-task-reflection` skill → capture decisions, edge cases, and lessons learned
---
## 8. CI Pipeline
On every pull request, workflows run in parallel:
```
PR opened
│
├── Quality checks ──────────── lint · typecheck · security · architecture boundaries
│
├── Core verify ─────────────── .feature files exist for changed services?
│ unit tests pass with coverage threshold?
│ make test-acceptance passes?
│
├── Per-service tests ───────── each service's test suite runs independently and in parallel
│
├── Architecture boundaries ─── no cross-service imports
│
└── PR completeness ─────────── PR description structure check
```
**What CI enforces (🤖):** `.feature` file existence · lint · typecheck · security · unit test coverage · acceptance test pass · architecture boundaries
**What is process-enforced (👤):** Mutation kill rates · contract tests · `make verify` results — all pasted into the PR description by the agent before opening
**What relies on human review (✋):** Was `.feature` written before code? Is the RED/GREEN proof genuine? Is the ADR substantive?
CI passing is necessary but not sufficient. The PR description carries the process-enforced evidence. The human reviewer checks both.
---
## 9. Meta-Design Principles
### Honest Enforcement Claims
The enforcement taxonomy explicitly acknowledges what machines can and cannot catch. This prevents the most dangerous failure mode in AI-assisted development: an agent believing the safety net is wider than it is, and taking shortcuts on that basis.
### Context Budget Awareness
Every orientation shortcut — context packs, service map, done examples, boundary checklists — exists because agents re-deriving the same context every session wastes tokens and introduces inconsistency. The readiness layer is an investment in reliable, fast agent orientation.
### Workflow as First-Class Citizen
The workflow is not just documented — it has a scored rubric, a CI gate, a matrix, dedicated skills, chatmodes, and concrete reference examples. "Done" has a precise definition with evidence requirements that apply equally to agents and humans.
### Institutional Memory
Gotchas are permanent records of real bugs. ADRs are permanent records of real decisions. Both are written in a format designed for agent consumption (indexed, structured, BECAUSE-claused). Every cycle leaves the codebase smarter than it found it.
### Layered Trust
Not everything can be automated. The project explicitly designates what requires CI (🤖), what requires a manual step (👤), and what requires agent discipline (✋). This layered trust model prevents both over-reliance on automation and the paralysis of trying to automate the unautomatable.
---
*This document describes the AI operational setup. It is intentionally generic — the patterns apply to any sufficiently complex software project operating with AI agent contributors.*