diff --git a/evidence/README.md b/evidence/README.md new file mode 100644 index 0000000..fec4f76 --- /dev/null +++ b/evidence/README.md @@ -0,0 +1,215 @@ +# Evidence Directory + +Machine-readable process results for the KRAIKEN optimizer pipeline. All formulas +(evolution, red-team, holdout, user-test) write structured JSON here. + +## Purpose + +- **Planner input** — the planner reads these files to decide next actions + (e.g. "last red-team showed IL vulnerability → trigger evolution"). +- **Diffable history** — `git log evidence/` shows how metrics change over time. +- **Permanent record** — separate from `tmp/` which is ephemeral. + +## Directory Layout + +``` +evidence/ + evolution/ + YYYY-MM-DD.json # run params, generation stats, best fitness, champion file + red-team/ + YYYY-MM-DD.json # per-attack results, floor held/broken, ETH extracted + holdout/ + YYYY-MM-DD-prNNN.json # per-scenario pass/fail, gate decision + user-test/ + YYYY-MM-DD.json # per-persona reports, screenshot refs, friction points +``` + +## Delivery Pattern + +Every formula follows the same three-step pattern: + +1. **Evidence file** → committed to `evidence/` on main +2. **Git artifacts** (new code, attack vectors, evolved programs) → PR +3. **Human summary** → issue comment with key metrics + link to evidence file + +--- + +## Schema: `evolution/YYYY-MM-DD.json` + +Records one optimizer evolution run. + +```json +{ + "date": "YYYY-MM-DD", + "run_params": { + "generations": 50, + "population_size": 20, + "seed": 42, + "base_optimizer": "OptimizerV3" + }, + "generation_stats": [ + { + "generation": 1, + "best_fitness": -12.4, + "mean_fitness": -34.1, + "worst_fitness": -91.2 + } + ], + "best_fitness": -8.7, + "champion_file": "onchain/src/OptimizerV4.sol", + "champion_commit": "abc1234", + "verdict": "improved" | "no_improvement" +} +``` + +| Field | Type | Description | +|-------|------|-------------| +| `date` | string (ISO) | Date of the run | +| `run_params` | object | Input parameters used | +| `generation_stats` | array | Per-generation fitness summary | +| `best_fitness` | number | Best fitness score achieved (lower = better loss for LM) | +| `champion_file` | string | Repo-relative path to winning optimizer | +| `champion_commit` | string | Git commit SHA of the champion (if promoted) | +| `verdict` | string | `"improved"` or `"no_improvement"` | + +--- + +## Schema: `red-team/YYYY-MM-DD.json` + +Records one adversarial red-team run against a candidate optimizer. + +```json +{ + "date": "YYYY-MM-DD", + "candidate": "OptimizerV3", + "optimizer_profile": "push3-default", + "lm_eth_before": 1000000000000000000000, + "lm_eth_after": 998500000000000000000, + "eth_extracted": 1500000000000000000, + "floor_held": false, + "verdict": "floor_broken" | "floor_held", + "attacks": [ + { + "strategy": "Flash buy + stake + recenter loop", + "pattern": "wrap → buy → stake → recenter_multi → sell", + "result": "DECREASED" | "HELD" | "INCREASED", + "delta_bps": -150, + "insight": "Rapid recenters pack ETH into floor while ratcheting it toward current price" + } + ] +} +``` + +| Field | Type | Description | +|-------|------|-------------| +| `date` | string (ISO) | Date of the run | +| `candidate` | string | Optimizer under test | +| `optimizer_profile` | string | Named profile / push3 variant | +| `lm_eth_before` | integer (wei) | LM total ETH at start | +| `lm_eth_after` | integer (wei) | LM total ETH at end | +| `eth_extracted` | integer (wei) | `lm_eth_before - lm_eth_after` (0 if floor held) | +| `floor_held` | boolean | `true` if no ETH was extracted | +| `verdict` | string | `"floor_held"` or `"floor_broken"` | +| `attacks[].strategy` | string | Human-readable strategy name | +| `attacks[].pattern` | string | Abstract op sequence (e.g. `wrap → buy → stake`) | +| `attacks[].result` | string | `"DECREASED"`, `"HELD"`, or `"INCREASED"` | +| `attacks[].delta_bps` | integer | LM ETH change in basis points | +| `attacks[].insight` | string | Key finding from this strategy | + +--- + +## Schema: `holdout/YYYY-MM-DD-prNNN.json` + +Records a holdout quality gate evaluation for a specific PR. + +```json +{ + "date": "YYYY-MM-DD", + "pr": 123, + "candidate_commit": "abc1234", + "scenarios": [ + { + "name": "bear_market_crash", + "passed": true, + "lm_eth_delta_bps": 12, + "notes": "" + }, + { + "name": "flash_buy_exploit", + "passed": false, + "lm_eth_delta_bps": -340, + "notes": "Floor broken on 2000-trade run" + } + ], + "scenarios_passed": 4, + "scenarios_total": 5, + "gate_passed": false, + "verdict": "pass" | "fail", + "blocking_scenarios": ["flash_buy_exploit"] +} +``` + +| Field | Type | Description | +|-------|------|-------------| +| `date` | string (ISO) | Date of evaluation | +| `pr` | integer | PR number being evaluated | +| `candidate_commit` | string | Commit SHA under test | +| `scenarios` | array | One entry per holdout scenario | +| `scenarios[].name` | string | Scenario identifier | +| `scenarios[].passed` | boolean | Whether LM ETH held or improved | +| `scenarios[].lm_eth_delta_bps` | integer | LM ETH change in basis points | +| `scenarios[].notes` | string | Free-text notes on failure mode | +| `scenarios_passed` | integer | Count of passing scenarios | +| `scenarios_total` | integer | Total scenarios run | +| `gate_passed` | boolean | `true` if all required scenarios passed | +| `verdict` | string | `"pass"` or `"fail"` | +| `blocking_scenarios` | array of strings | Scenario names that caused failure | + +--- + +## Schema: `user-test/YYYY-MM-DD.json` + +Records a UX evaluation run across simulated personas. + +```json +{ + "date": "YYYY-MM-DD", + "personas": [ + { + "name": "crypto_native", + "task": "stake_and_set_tax_rate", + "completed": true, + "friction_points": [], + "screenshot_refs": ["tmp/screenshots/crypto_native_stake.png"], + "notes": "" + }, + { + "name": "defi_newcomer", + "task": "first_buy_and_stake", + "completed": false, + "friction_points": ["Tax rate slider label unclear", "No confirmation of stake tx"], + "screenshot_refs": ["tmp/screenshots/defi_newcomer_confused.png"], + "notes": "User abandoned at tax rate step" + } + ], + "personas_completed": 1, + "personas_total": 2, + "critical_friction_points": ["Tax rate slider label unclear"], + "verdict": "pass" | "fail" +} +``` + +| Field | Type | Description | +|-------|------|-------------| +| `date` | string (ISO) | Date of evaluation | +| `personas` | array | One entry per simulated persona | +| `personas[].name` | string | Persona identifier | +| `personas[].task` | string | Task the persona attempted | +| `personas[].completed` | boolean | Whether the task was completed | +| `personas[].friction_points` | array of strings | UX issues encountered | +| `personas[].screenshot_refs` | array of strings | Repo-relative paths to screenshots | +| `personas[].notes` | string | Free-text observations | +| `personas_completed` | integer | Count of personas who completed their task | +| `personas_total` | integer | Total personas evaluated | +| `critical_friction_points` | array of strings | Friction points that blocked task completion | +| `verdict` | string | `"pass"` if all personas completed, `"fail"` otherwise | diff --git a/evidence/evolution/.gitkeep b/evidence/evolution/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/evidence/holdout/.gitkeep b/evidence/holdout/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/evidence/red-team/.gitkeep b/evidence/red-team/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/evidence/user-test/.gitkeep b/evidence/user-test/.gitkeep new file mode 100644 index 0000000..e69de29