Merge pull request 'fix: Evidence directory structure for process results (#973)' (#994) from fix/issue-973 into master

Reviewed-on: https://codeberg.org/johba/harb/pulls/994
2026-03-19 09:55:27 +01:00 · 2026-03-19 09:55:27 +01:00 · adc20733ce
commit adc20733ce
parent bbf1d8f6e6 7dbee803fb
5 changed files with 215 additions and 0 deletions
--- a/evidence/README.md
+++ b/evidence/README.md
@ -0,0 +1,215 @@
+# Evidence Directory
+
+Machine-readable process results for the KRAIKEN optimizer pipeline. All formulas
+(evolution, red-team, holdout, user-test) write structured JSON here.
+
+## Purpose
+
+- **Planner input** — the planner reads these files to decide next actions
+  (e.g. "last red-team showed IL vulnerability → trigger evolution").
+- **Diffable history** — `git log evidence/` shows how metrics change over time.
+- **Permanent record** — separate from `tmp/` which is ephemeral.
+
+## Directory Layout
+
+```
+evidence/
+  evolution/
+    YYYY-MM-DD.json       # run params, generation stats, best fitness, champion file
+  red-team/
+    YYYY-MM-DD.json       # per-attack results, floor held/broken, ETH extracted
+  holdout/
+    YYYY-MM-DD-prNNN.json # per-scenario pass/fail, gate decision
+  user-test/
+    YYYY-MM-DD.json       # per-persona reports, screenshot refs, friction points
+```
+
+## Delivery Pattern
+
+Every formula follows the same three-step pattern:
+
+1. **Evidence file** → committed to `evidence/` on main
+2. **Git artifacts** (new code, attack vectors, evolved programs) → PR
+3. **Human summary** → issue comment with key metrics + link to evidence file
+
+---
+
+## Schema: `evolution/YYYY-MM-DD.json`
+
+Records one optimizer evolution run.
+
+```json
+{
+  "date": "YYYY-MM-DD",
+  "run_params": {
+    "generations": 50,
+    "population_size": 20,
+    "seed": 42,
+    "base_optimizer": "OptimizerV3"
+  },
+  "generation_stats": [
+    {
+      "generation": 1,
+      "best_fitness": -12.4,
+      "mean_fitness": -34.1,
+      "worst_fitness": -91.2
+    }
+  ],
+  "best_fitness": -8.7,
+  "champion_file": "onchain/src/OptimizerV4.sol",
+  "champion_commit": "abc1234",
+  "verdict": "improved" | "no_improvement"
+}
+```
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `date` | string (ISO) | Date of the run |
+| `run_params` | object | Input parameters used |
+| `generation_stats` | array | Per-generation fitness summary |
+| `best_fitness` | number | Best fitness score achieved (lower = better loss for LM) |
+| `champion_file` | string | Repo-relative path to winning optimizer |
+| `champion_commit` | string | Git commit SHA of the champion (if promoted) |
+| `verdict` | string | `"improved"` or `"no_improvement"` |
+
+---
+
+## Schema: `red-team/YYYY-MM-DD.json`
+
+Records one adversarial red-team run against a candidate optimizer.
+
+```json
+{
+  "date": "YYYY-MM-DD",
+  "candidate": "OptimizerV3",
+  "optimizer_profile": "push3-default",
+  "lm_eth_before": 1000000000000000000000,
+  "lm_eth_after": 998500000000000000000,
+  "eth_extracted": 1500000000000000000,
+  "floor_held": false,
+  "verdict": "floor_broken" | "floor_held",
+  "attacks": [
+    {
+      "strategy": "Flash buy + stake + recenter loop",
+      "pattern": "wrap → buy → stake → recenter_multi → sell",
+      "result": "DECREASED" | "HELD" | "INCREASED",
+      "delta_bps": -150,
+      "insight": "Rapid recenters pack ETH into floor while ratcheting it toward current price"
+    }
+  ]
+}
+```
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `date` | string (ISO) | Date of the run |
+| `candidate` | string | Optimizer under test |
+| `optimizer_profile` | string | Named profile / push3 variant |
+| `lm_eth_before` | integer (wei) | LM total ETH at start |
+| `lm_eth_after` | integer (wei) | LM total ETH at end |
+| `eth_extracted` | integer (wei) | `lm_eth_before - lm_eth_after` (0 if floor held) |
+| `floor_held` | boolean | `true` if no ETH was extracted |
+| `verdict` | string | `"floor_held"` or `"floor_broken"` |
+| `attacks[].strategy` | string | Human-readable strategy name |
+| `attacks[].pattern` | string | Abstract op sequence (e.g. `wrap → buy → stake`) |
+| `attacks[].result` | string | `"DECREASED"`, `"HELD"`, or `"INCREASED"` |
+| `attacks[].delta_bps` | integer | LM ETH change in basis points |
+| `attacks[].insight` | string | Key finding from this strategy |
+
+---
+
+## Schema: `holdout/YYYY-MM-DD-prNNN.json`
+
+Records a holdout quality gate evaluation for a specific PR.
+
+```json
+{
+  "date": "YYYY-MM-DD",
+  "pr": 123,
+  "candidate_commit": "abc1234",
+  "scenarios": [
+    {
+      "name": "bear_market_crash",
+      "passed": true,
+      "lm_eth_delta_bps": 12,
+      "notes": ""
+    },
+    {
+      "name": "flash_buy_exploit",
+      "passed": false,
+      "lm_eth_delta_bps": -340,
+      "notes": "Floor broken on 2000-trade run"
+    }
+  ],
+  "scenarios_passed": 4,
+  "scenarios_total": 5,
+  "gate_passed": false,
+  "verdict": "pass" | "fail",
+  "blocking_scenarios": ["flash_buy_exploit"]
+}
+```
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `date` | string (ISO) | Date of evaluation |
+| `pr` | integer | PR number being evaluated |
+| `candidate_commit` | string | Commit SHA under test |
+| `scenarios` | array | One entry per holdout scenario |
+| `scenarios[].name` | string | Scenario identifier |
+| `scenarios[].passed` | boolean | Whether LM ETH held or improved |
+| `scenarios[].lm_eth_delta_bps` | integer | LM ETH change in basis points |
+| `scenarios[].notes` | string | Free-text notes on failure mode |
+| `scenarios_passed` | integer | Count of passing scenarios |
+| `scenarios_total` | integer | Total scenarios run |
+| `gate_passed` | boolean | `true` if all required scenarios passed |
+| `verdict` | string | `"pass"` or `"fail"` |
+| `blocking_scenarios` | array of strings | Scenario names that caused failure |
+
+---
+
+## Schema: `user-test/YYYY-MM-DD.json`
+
+Records a UX evaluation run across simulated personas.
+
+```json
+{
+  "date": "YYYY-MM-DD",
+  "personas": [
+    {
+      "name": "crypto_native",
+      "task": "stake_and_set_tax_rate",
+      "completed": true,
+      "friction_points": [],
+      "screenshot_refs": ["tmp/screenshots/crypto_native_stake.png"],
+      "notes": ""
+    },
+    {
+      "name": "defi_newcomer",
+      "task": "first_buy_and_stake",
+      "completed": false,
+      "friction_points": ["Tax rate slider label unclear", "No confirmation of stake tx"],
+      "screenshot_refs": ["tmp/screenshots/defi_newcomer_confused.png"],
+      "notes": "User abandoned at tax rate step"
+    }
+  ],
+  "personas_completed": 1,
+  "personas_total": 2,
+  "critical_friction_points": ["Tax rate slider label unclear"],
+  "verdict": "pass" | "fail"
+}
+```
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `date` | string (ISO) | Date of evaluation |
+| `personas` | array | One entry per simulated persona |
+| `personas[].name` | string | Persona identifier |
+| `personas[].task` | string | Task the persona attempted |
+| `personas[].completed` | boolean | Whether the task was completed |
+| `personas[].friction_points` | array of strings | UX issues encountered |
+| `personas[].screenshot_refs` | array of strings | Repo-relative paths to screenshots |
+| `personas[].notes` | string | Free-text observations |
+| `personas_completed` | integer | Count of personas who completed their task |
+| `personas_total` | integer | Total personas evaluated |
+| `critical_friction_points` | array of strings | Friction points that blocked task completion |
+| `verdict` | string | `"pass"` if all personas completed, `"fail"` otherwise |
--- a/evidence/evolution/.gitkeep
+++ b/evidence/evolution/.gitkeep
--- a/evidence/holdout/.gitkeep
+++ b/evidence/holdout/.gitkeep
--- a/evidence/red-team/.gitkeep
+++ b/evidence/red-team/.gitkeep
--- a/evidence/user-test/.gitkeep
+++ b/evidence/user-test/.gitkeep