harb/scripts/harb-evaluator/AGENTS.md
johba ca7162fedb chore: gardener housekeeping 2026-03-27
AGENTS.md watermarks refreshed to HEAD (7d72f40).
landing/AGENTS.md: document new pitch-deck.html (influencer outreach).

Grooming: CLEAN — 5 open issues (2 prediction/backlog, 3 vision), no
backlog issues, no blocked issues, no open PRs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 06:03:33 +00:00

3.5 KiB

Agent Brief: harb-evaluator

The evaluator runtime executes formula-defined pipelines. Scripts in this directory handle stack lifecycle, scenario execution, evidence collection, and the adversarial agent harness.

Directory Layout

File Purpose
evaluate.sh Holdout gate: worktree checkout → Docker stack → Playwright scenarios → teardown
red-team.sh Adversarial agent runner: Anvil bootstrap → attack suite → Claude agent → evidence
run-protocol.sh On-chain health snapshot (TVL, fees, positions, rebalances) via cast/forge
run-resources.sh Infrastructure snapshot (disk, RAM, API budget, CI queue) via shell commands
bootstrap-light.sh Lightweight Anvil bootstrap with contract deployment (used by red-team.sh)
promote-attacks.sh Deduplicate and PR novel attack vectors discovered by the red-team agent
export-attacks.py Extract cast send commands from agent stream log into .jsonl attack files
red-team-program.md System prompt for the adversarial Claude agent
holdout.config.ts Playwright config for holdout scenario execution
helpers/ TypeScript helpers: RPC, assertions, swap, stake, floor, market, reporting, wallet
scenarios/ Holdout scenario scripts and the passive-confidence suite

Exit Code Convention

All evaluator scripts follow the same three-code contract:

Code Meaning
0 Success / gate passed
1 Gate failed (scenario or attack found a problem)
2 Infrastructure error (stack down, missing dependency, RPC unreachable)

Formulas and the orchestrator rely on these codes for routing — do not introduce additional exit codes without updating the formula TOML.

Stack Lifecycle

Heavy formulas (run-holdout, run-red-team, run-evolution) need a running Anvil or full Docker stack. Port 8545 is shared — these formulas are mutually exclusive and must not run concurrently.

  • evaluate.sh manages Docker compose (harb-eval-{pr} project) with full teardown via shell trap.
  • red-team.sh uses bootstrap-light.sh for a lightweight Anvil-only stack (no Docker). Cleanup is also trap-registered.
  • run-protocol.sh and run-resources.sh are lightweight — no Anvil, no Docker.

Evidence Output

Every script writes its evidence file to evidence/{category}/{date}.json conforming to the schema in evidence/README.md. The deliver step in each formula handles committing and posting an issue comment.

Wallet Connection Helper

helpers/wallet.tsconnectWallet(page) handles the Playwright wallet connection flow. Key behaviours:

  • Detects auto-reconnect: if wagmi already reconnected from storage (.connect-button--connected visible within 1 s), returns immediately.
  • Opens the connectors panel via .connect-button--disconnected (10 s timeout — wagmi needs time to settle into disconnected state after page load).
  • Falls back to mobile hamburger menu if desktop button not found.

Adding a New Evaluator Script

  1. Place the script in this directory. Use #!/usr/bin/env bash and set -euo pipefail.
  2. Follow the exit code convention (0 / 1 / 2).
  3. Accept configuration via environment variables, not positional args (except evaluate.sh which takes a PR number).
  4. Write evidence to evidence/{category}/{date}.json.
  5. Wire it into a formula TOML in formulas/ — see formulas/AGENTS.md for the full walkthrough.