A little while ago I shared that I built my own cyber security team but did not share how, I've come to correct that.
********I'm NOT a cyber security professional or Software Developer. ******
-----------------------------------------------------
First the why I did this:
A bit of philosophy on where I believe we are going that I think aligns with LESSON 3.3: The Real Cost of Knowledge about hyper accessibility of AI and low cost on a decentralize level. Reduction of cost to AI lowers the barrier to entry which is great. Similarly to cost being lowered the required knowledge to attempt to build software is being lowered (Note I said attempt).
As it gets easier to try and build things people will seek tools and the path of least friction. Ask your AI where it can get tools to be better and two sties might come up near the top.
Now we all think this is great more access is amazing and IT IS. However, like Uncle Ben told Peter Parker (Spider-Man) "With Great Power Comes Great Responsibility."
So a world where getting all the data to build anything is al carte and easy to get and deploy, I foresee that the user base of both github and huggingface is going to grow like Tiktok or Instagram.
Again that Great, except 1 thing, we are not sharing reels we are sharing code that can be used for GOOD or BAD. As the layman like myself enter these deeply technical water, the folks with ill intent and wish to make fast money off a malicious download will be increasing their efforts.
Why would more malicious code or cyber attacks increase on github or huggingface? Because they will be "Targe Rich Environment" due to low entry point of knowledge. Think of scams as tech as evolved they have and will always be there.
Now none of this is meant to fear monger or discourage us from pushing forward and building all the things that can and will make our communities and lives better places.
It is to say we need to become more educated which is why I am here and learning from the this community, Jake, and the other thought leaders.
They are generously giving of time and more.
So here is my small contribution.
-----------------------------------------------------------------
With Claude Code and some help from a friend in the cyber security space.
I built Harpocrates (Greek God of Silence, Secrets, and Confidentiality)
I've also attached a powerpoint and a zip file with all the markdown files (they are sanitized so they don't have my folder paths).
----------------------------------------------------------------
Implementation Guide: The Harpocrates Security Layer
1. Executive Summary
The system detailed in this guide is an adapted Harpocrates-style security architecture designed to wrap an AI coding tool (specifically Claude Code) in multiple layers of preventative and containment-based controls. It is a local developer safety net that aims to prevent autonomous AI agents from performing destructive actions, leaking sensitive credentials, or compromising the integrity of the local workstation during an AI-assisted coding session.
The primary objective of the system is to establish a fail-closed environment where every tool call made by an AI agent is scrutinized. This is achieved through a multi-layered stack: synchronous PreToolUse hooks for immediate pattern matching, an asynchronous multi-agent review pipeline for contextual analysis, an immutable supervisor for quality control, container hardening to reduce the blast radius of any potential escape, and a credential sandbox to keep high-value secrets entirely out of the agent's reach.
Scope of Protection: The system protects a single, local AI coding session on the specific machine where it is deployed. It effectively blocks tool calls that violate defined rules, catches complex risks via review pipelines, and limits environmental damage through kernel-level isolation.
What the System Does Not Protect:
- Remote Environments: It has no reach into CI/CD pipelines (GitHub Actions, GitLab Runners, etc.).
- Cloud Infrastructure: It cannot prevent the use of an already-obtained API key to call cloud services directly; the cloud service remains unaware of these local hooks.
- Collaborative Sessions: It does not protect the sessions of teammates or the security of shared source code on other machines.
- Compliance Posture: It is a personal safety tool, not a substitute for organizational security controls required by auditors (SOC 2, HIPAA, etc.).
- Input-based Attacks: It does not protect against social engineering or prompt injection into the session context; it governs the agent's actions, not its thinking.
Capability Requirements: To successfully implement this system, a developer must possess a high level of comfort with Linux systems and standard development utilities. Required skills include:
- Proficiency in Bash scripting (specifically the use of set -euo pipefail and error-handling traps).
- The ability to read, write, and validate nested JSON configuration files.
- Familiarity with Docker Engine basics (running containers, mounting volumes, and inspecting state).
- A conceptual understanding of "hooks" as event-driven functions.
- Basic knowledge of Linux kernel capabilities and the decomposition of root privileges.
- Access to a Linux workstation (ideally Ubuntu 24.04 LTS) with root or sudo permissions.
2. Reading-Order Recommendation
The build guide is structured into ten sections (00 through 99), which should be consumed and implemented in a specific sequence to ensure the foundation of each layer is solid before adding subsequent complexity.
Order
Section
Rationale for Sequence
1
00-Prerequisites
Establishes the environment and ensures the developer has the necessary tools (Docker Engine, Python stdlib) and skills. Everything follows this environmental baseline.
2
01-Minimum Viable System
Introduces the most basic enforcement mechanism: the synchronous PreToolUse hook. This is the irreducible core of the Harpocrates pattern.
3
02-Agent Review Pipeline
Builds the "slow-path" for the hooks created in Section 01. It introduces the dispatch-to-review concept for handling ambiguous cases.
4
03-Supervisor and Immutable Values
Adds the grading layer. Once you have reviewers (Section 02), you need a way to ensure they do not drift or fail. This adds meta-oversight.
5
04-Container Hardening
Shifts from prevention to containment. It assumes the previous layers (01-03) might miss a novel threat and limits the "blast radius" within the runtime.
6
05-Credential Sandbox
Focuses on the highest-value targets: secrets. It reuses isolation principles from Section 04 to keep credentials away from the agent's process.
7
06-Verification and Ground Truth
Provides the testing harness for all previous layers. You cannot verify the system works until you have built the system (01-05).
8
08-Containerized Security Tooling
Isolates the watchers from the watched. By moving the security tools into their own container, you protect the security logic from an agent that might compromise the host.
9
07-Safe-Install Pipeline
Addresses supply chain risk. This is the most "fingerprint-sensitive" layer and is implemented last so the developer has mastered all other defensive disciplines first.
10
99-Resource Inventory
A final audit of all dependencies used in the build to ensure no unapproved third-party tools have entered the environment.
Dependencies: Section 01 depends on the environment from 00. Section 02 depends on the dispatch hooks from 01. Section 03 depends on the findings generated in 02. Section 04 and 08 share kernel-level hardening concepts. Section 07 leverages the multi-agent review logic from 02 and 03.
3. Section-by-Section Walkthrough
Section 00: Prerequisites
- Principle: A security system is only as reliable as the environment it is built upon and the builder's understanding of its limitations.
- Design Questions:Am I using native Docker Engine to avoid the unpredictable isolation layers of Docker Desktop?Is my environment strictly using the Python standard library to minimize supply chain audit surface?
- Failure Mode: "Fighting the environment." Using incorrect runtimes (like WSL 2 without careful configuration) leads to false positives in capability enforcement and network isolation.
- Self-Test Items:[ ] Test 1: Verify version strings for shell and core utilities (bash, grep, openssl, etc.).[ ] Test 2: Verify Python 3.10+ and standard library availability.[ ] Test 3: Confirm native Docker Engine 27.x+ is running and functional.[ ] Test 4: Launch Claude Code and verify it is operational.[ ] Test 5: Confirm root or sudo access is correctly configured.
- Reflection Question: Can I describe exactly which assets on my machine this system will not protect once I finish the build?
Section 01: The Minimum Viable System
- Principle: Every security enforcement mechanism must fail closed, treating any internal error or crash as a "block" decision.
- Design Questions:Does my hook script begin with set -euo pipefail to ensure strict error handling?Do I initialize my decision variable to "block" by default before running any checks?
- Failure Mode: "Silent failure open." A hook that continues execution after a regex error or a missing variable could allow a dangerous command to proceed.
- Self-Test Items:[ ] Confirm allow-all.sh logs tool calls successfully.[ ] Verify block-rm-root.sh passes five shell unit tests (blocking recursive root deletes).[ ] Observe the hook successfully blocking a test command in a live Claude Code session.[ ] Verify timestamps and reasons are present in the hook log file.
- Reflection Question: Why is it safer to test my hooks with piped JSON in the shell rather than testing them through the AI agent first?
Section 02: The Agent Review Pipeline
- Principle: Complex or ambiguous actions require specialized, multi-lens review rather than simple pattern matching.
- Design Questions:Is my review queue using atomic filesystem operations (tempfile + rename) to avoid race conditions?Does my hook implement a mandatory timeout that defaults to a "block" verdict if the review agents take too long?
- Failure Mode: "Specialist blind spots." Relying on a single general-purpose reviewer allows risks to slip through that a specialized auditor (e.g., a Data Sensitivity Reviewer) would catch.
- Self-Test Items:[ ] Verify the review queue directory layout exists and is writable.[ ] Manually drop a request JSON and confirm a verdict JSON appears (testing filesystem plumbing).[ ] Confirm the hook's third decision branch (dispatch) successfully polls for verdicts.[ ] Verify that a timed-out review results in a BLOCK.
- Reflection Question: In the context of the Calendar API example, why should the "Endpoint Surface Analyst" be forbidden from looking at the "User Intent" data?
Section 03: The Supervisor and Immutable Values
- Principle: The grading standard for security reviewers must be structurally isolated and immutable to prevent learning drift and adversarial corruption.
- Design Questions:Is the Supervisor isolated from raw tool call data to prevent prompt injection from reaching the grading layer?Are the Supervisor's core values stored in a read-only file that cannot be modified by any agent process?
- Failure Mode: "Value drift." Without a supervisor, a review pipeline's judgment will slowly shift toward over-confidence or generic hedging as it encounters new, uncalibrated cases.
- Self-Test Items:[ ] Confirm the values file is read-only in the Supervisor's runtime context.[ ] Verify the Supervisor prompt includes an anti-instruction-injection clause.[ ] Run the grading harness and confirm it produces a structured report card from a batch of verdicts.[ ] Verify that no path exists for raw tool call data to reach the Supervisor.
- Reflection Question: What is the difference between "learning what to look at" and "learning how to judge," and why is only the former allowed?
Section 04: Container Hardening
- Principle: Hardening assumes that preventative layers will eventually fail, necessitating a reduced blast radius for any compromised workload.
- Design Questions:Have I dropped all kernel capabilities and added back only those with a specific, written justification?Is my network egress enforcement configured as "deny-by-default" at a layer the container cannot control?
- Failure Mode: "Privilege escalation." A compromised container with default capabilities or the ability to spawn setuid binaries allows an attacker to pivot from the agent session to the host system.
- Self-Test Items:[ ] List every added kernel capability alongside its justification.[ ] Verify that the container cannot reach a non-allowlisted network destination.[ ] Confirm security logs are marked with the append-only attribute from the host side.[ ] Test the "no-new-privileges" flag by attempting to run a setuid binary inside the container.
- Reflection Question: Why must the append-only attribute on log files be set by the host rather than the container itself?
Section 05: The Credential Sandbox
- Principle: AI agents should never see raw credentials; instead, they should interact with local proxies that inject secrets at the boundary.
- Design Questions:Are my credential categories defined by purpose (e.g., Cloud Provider, Git SSH) rather than just specific file paths?Does my proxy correctly strip potentially echoed secrets from the remote API's response before returning it to the agent?
- Failure Mode: "Context exfiltration." If an agent sees a secret, that secret is now in its context window and may be sent to a remote inference service, expanding the credential's threat model.
- Self-Test Items:[ ] Confirm the hook blocks access to at least five credential categories.[ ] Verify the local proxy can successfully inject a token into a test request.[ ] Confirm the agent's process environment does not contain high-value API keys.[ ] Verify that the proxy fails closed if it cannot reach the credential store.
- Reflection Question: Why is "instruction-level" protection (telling the agent not to read secrets) insufficient compared to "hook-level" enforcement?
Section 06: Verification and Ground Truth
- Principle: Security claims must be substantiated by positive evidence from isolated tests rather than the mere absence of detected incidents.
- Design Questions:Am I testing components like hooks and proxies in isolation, or am I erroneously testing them through the AI agent?Is my ground-truth dataset curated by humans and documented with reasoning for every verdict?
- Failure Mode: "False confidence." Assuming the system works because no alerts have fired, when in reality the agent may simply not have attempted a dangerous action yet.
- Self-Test Items:[ ] Run the direct hook tests and confirm pass/fail results for specific regex patterns.[ ] Perform a proxy round-trip test using a fake upstream server.[ ] Execute a calibration run for the Supervisor using ten human-graded cases.[ ] Log the results of an adversarial probing session.
- Reflection Question: If my verification harness marks every case as "Pass," what should I do to ensure the harness itself hasn't failed open?
Section 08: Containerizing the Security Tooling
- Principle: The watchers (security tools) must be isolated from the watched (agent environment) to prevent an attacker from modifying their own surveillance.
- Design Questions:Is the interface between the agent environment and the security container as narrow as possible?Does the security container treat every inbound message as untrusted, untethered input?
- Failure Mode: "Watchman compromise." An attacker who compromises the agent host can read the hook logic and review prompts, crafting "invisible" tool calls that the system cannot detect.
- Self-Test Items:[ ] Verify the security container is reachable only through a specific, validated port/protocol.[ ] Confirm the Supervisor's values file is not accessible from the agent's host filesystem.[ ] Test the fail-closed behavior by stopping the security container and attempting a tool call.[ ] Verify that outbound verdicts do not leak internal reasoning (Reviewer findings).
- Reflection Question: Why is the narrow interface the most critical attack surface in this containerized design?
Section 07: The Safe-Install Pipeline
- Principle: Every new tool or dependency is a supply chain decision that must be made visible, deliberate, and auditable.
- Design Questions:Does my pipeline stage 1 fetch the package without executing any post-install scripts?Is the authorized override mechanism restricted to a specific human authority via a separate channel?
- Failure Mode: "Invisible supply chain poisoning." Allowing install commands to run unchecked permits malicious post-install scripts to execute with user privileges.
- Self-Test Items:[ ] Successfully fetch a package into a quarantine area without executing scripts.[ ] Confirm the multi-agent review (from Section 02) flags a package with suspicious permissions.[ ] Verify an install is blocked if the verdict is "NO."[ ] Test the expiration of a time-bounded authorized override.
- Reflection Question: Why must the implementation details of the override mechanism remain private even if its properties are public?
Section 99: Resource Inventory
- Principle: The system's integrity depends on an audited, restricted list of dependencies with verifiable provenance.
- Design Questions:Are there any tools in my build that are not explicitly listed in this inventory?Can I justify the provenance of every Docker base image and OS utility used?
- Failure Mode: "Inventory drift." Installing "clever" third-party utilities that haven't been audited introduces new, unmonitored supply chain risks.
- Self-Test Items:[ ] Cross-reference every shell script against the "Operating system utilities" list.[ ] Confirm no pip install commands were used in the build.[ ] Verify all documentation references used are from first-party vendors.
- Reflection Question: How does the "stdlib-only" rule for Python lead to a more secure system than using feature-rich third-party packages?
--------------------------------------------------------------------------------
4. Consolidated Test for Authorized Override
The authorized override mechanism (from Section 07) is the "break-glass" path for allowing installations that the automated pipeline would otherwise block. To be considered secure, an override must pass the following four-property test:
- Time-Bounded: The override must expire automatically after a short, defined window.Why it is necessary: Prevents an override from becoming a permanent backdoor in the system.Failure if absent: An attacker could reuse an old authorization to install malicious software weeks later.
- Scoped: The override must apply only to a specific package name and version.Why it is necessary: Ensures that permission for one tool does not grant permission for another.Failure if absent: A single "ALLOW" could be used to smuggle in an entirely different, unreviewed malicious package.
- Logged: Every issuance, use, and expiration of an override must be recorded in the append-only log.Why it is necessary: Provides an audit trail for incident response and accountability.Failure if absent: If a compromise occurs, there is no way to determine who authorized the dangerous tool or when.
- Authority-Required: The override must be issued by a human with the specific permission to do so, verified through a channel the AI agent cannot access.Why it is necessary: Distinguishes between the user (who might be the agent) and an authorized security arbiter.Failure if absent: An AI agent could "authorize" its own dangerous installations by simply typing the override command.
--------------------------------------------------------------------------------
5. The Five Composing Layers
The system achieves defense in depth by composing five distinct layers. Each layer addresses a different stage of the threat lifecycle:
- PreToolUse Hook: This is the synchronous front door. It uses fast pattern matching (regex) to block obviously dangerous commands like recursive root deletes or credential file reads. It provides the first, cheapest filter against accidental damage.
- Review Pipeline: When a hook sees something ambiguous (e.g., a complex calendar API call), it dispatches to this asynchronous layer. Five specialized role agents (Endpoint, Data, Permission, Intent, and Consolidator) analyze the context to find risks that simple regex cannot catch.
- Supervisor: This layer oversees the reviewers. It uses immutable values and blind grading to ensure the agents in the review pipeline do not drift in their judgment. It catches systematic failures or "hallucinations" in the review logic over time.
- Container Hardening: This layer operates on the assumption that an attacker might bypass the first three layers. By dropping kernel capabilities, allowlisting egress, and enforcing append-only logs, it ensures that even a compromised agent session cannot easily pivot to the host or exfiltrate data.
- Credential Sandbox: This layer removes the primary motive for many attacks. By using a local proxy to inject secrets, the agent never actually "sees" the API keys. This means there is nothing to leak in the context window, even if the agent is perfectly social-engineered or compromised.
--------------------------------------------------------------------------------
6. What to Do if You Get Stuck
First-time builders are likely to encounter these specific failure modes. The guide provides the following resolutions:
- The Hook Isn't Firing: If you issue a tool call and the log file remains empty, your registration with Claude Code is likely incorrect. Action: Do not guess the configuration; consult the official Claude Code documentation for the exact PreToolUse registration keys for your version.
- The System Hangs During Review: This usually means the hook is waiting for a verdict that is never arriving. Action: Ensure your hook has a mandatory timeout. If the timeout expires, the hook must return a non-zero exit code to "fail closed." Check the review worker process to see if it is stuck.
- "Permission Denied" During Hardening: This often happens when a container tries to use a capability you dropped or reach a network destination not in the allowlist. Action: Use docker inspect to verify the applied capabilities. If a feature is broken, write a formal justification before adding the capability back; do not just add "root" privileges to solve the problem.
- Credential Injection Fails: If the AI agent receives a 401 Unauthorized from the proxy, the proxy may be failing to read the secret. Action: Check the proxy’s logs. Ensure the proxy has the filesystem permissions to read the credential store—permissions that the agent’s user should explicitly not have.
- The Supervisor's Grades Look Random: If the Supervisor's calibration curve is unstable, the prompt likely contains ambiguous language. Action: Grep your Supervisor prompt for qualifiers like "usually" or "if appropriate." Remove them. Values must be unconditional.
- Validation Errors in the Queue: A mismatch in JSON schemas between the hook (the writer) and the review worker (the reader) will crash the pipeline. Action: Use python3 -m json.tool to validate your JSON files manually. Ensure both the hook and worker agree on the exact field names.
--------------------------------------------------------------------------------
7. Glossary
- PreToolUse hook: A program registered with the AI coding tool that executes after the agent proposes a tool call but before the call is actually run, allowing the system to block the action.
- Fail-closed: A design principle where any failure, crash, or timeout in a security component results in the most restrictive outcome (a block) rather than allowing the action to proceed.
- Dispatch decision: The logic within a hook that decides to hand off a tool call to the slower, asynchronous review pipeline because it is too complex for simple regex judgment.
- Immutable values: A set of core principles for the Supervisor that are stored in a read-only file, ensuring the grading standard cannot be modified or corrupted by agentic drift or adversarial input.
- Blind grading: A verification process where the Supervisor is asked to grade a "calibration case" (where the correct answer is already known) without being told that it is being tested.
- Append-only logs: A filesystem-level enforcement that allows a file to be added to but never modified or truncated, protecting security evidence from being erased by an attacker.
- Narrow interface: A highly restricted communication channel between isolated environments (e.g., the agent host and the security container) that allows only specific, validated data types to cross the boundary.
- Ground truth: A curated, human-verified dataset of "correct" outcomes used to verify that the security system's components (hooks, reviewers, supervisors) are performing accurately.
- Adversarial probing: The practice of deliberately crafting unusual or "tricky" tool calls to find gaps in the security layer's coverage that standard tests might miss.
- Time-bounded authorized override: A temporary, scoped bypass of a security block that is granted by a human authority and expires automatically after a short duration.