harb/evidence
johba b883cde275 evidence: fix red-team baseline — accurate per-attack measurements
Addresses REQUEST_CHANGES review on PR #1065:

1. candidate: "Optimizer" (matches DeployLocal.sol deployment)
2. optimizer_profile: "default" (not push3-default — base Optimizer)
3. candidate_commit: master HEAD SHA for reproducibility
4. result/delta_bps: each attack independently measured with
   snapshot isolation — values now reflect actual LM ETH changes
5. Floor Ratchet attack tested: INCREASED +1179 bps. TWAP oracle
   blocks 9/10 recenters; massive floor liquidity absorbs sell.
6. lm_eth values as strings to avoid JS safe-integer truncation
7. lm_eth_before = lm_eth_after (attacks reverted between tests)

Re: #1058

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 06:31:33 +00:00
..
evolution fix: Evidence directory structure for process results (#973) 2026-03-19 08:28:04 +00:00
holdout fix: Evidence directory structure for process results (#973) 2026-03-19 08:28:04 +00:00
red-team evidence: fix red-team baseline — accurate per-attack measurements 2026-03-21 06:31:33 +00:00
user-test fix: Evidence directory structure for process results (#973) 2026-03-19 08:28:04 +00:00
README.md fix: Evidence directory structure for process results (#973) 2026-03-19 08:28:04 +00:00

Evidence Directory

Machine-readable process results for the KRAIKEN optimizer pipeline. All formulas (evolution, red-team, holdout, user-test) write structured JSON here.

Purpose

  • Planner input — the planner reads these files to decide next actions (e.g. "last red-team showed IL vulnerability → trigger evolution").
  • Diffable historygit log evidence/ shows how metrics change over time.
  • Permanent record — separate from tmp/ which is ephemeral.

Directory Layout

evidence/
  evolution/
    YYYY-MM-DD.json       # run params, generation stats, best fitness, champion file
  red-team/
    YYYY-MM-DD.json       # per-attack results, floor held/broken, ETH extracted
  holdout/
    YYYY-MM-DD-prNNN.json # per-scenario pass/fail, gate decision
  user-test/
    YYYY-MM-DD.json       # per-persona reports, screenshot refs, friction points

Delivery Pattern

Every formula follows the same three-step pattern:

  1. Evidence file → committed to evidence/ on main
  2. Git artifacts (new code, attack vectors, evolved programs) → PR
  3. Human summary → issue comment with key metrics + link to evidence file

Schema: evolution/YYYY-MM-DD.json

Records one optimizer evolution run.

{
  "date": "YYYY-MM-DD",
  "run_params": {
    "generations": 50,
    "population_size": 20,
    "seed": 42,
    "base_optimizer": "OptimizerV3"
  },
  "generation_stats": [
    {
      "generation": 1,
      "best_fitness": -12.4,
      "mean_fitness": -34.1,
      "worst_fitness": -91.2
    }
  ],
  "best_fitness": -8.7,
  "champion_file": "onchain/src/OptimizerV4.sol",
  "champion_commit": "abc1234",
  "verdict": "improved" | "no_improvement"
}
Field Type Description
date string (ISO) Date of the run
run_params object Input parameters used
generation_stats array Per-generation fitness summary
best_fitness number Best fitness score achieved (lower = better loss for LM)
champion_file string Repo-relative path to winning optimizer
champion_commit string Git commit SHA of the champion (if promoted)
verdict string "improved" or "no_improvement"

Schema: red-team/YYYY-MM-DD.json

Records one adversarial red-team run against a candidate optimizer.

{
  "date": "YYYY-MM-DD",
  "candidate": "OptimizerV3",
  "optimizer_profile": "push3-default",
  "lm_eth_before": 1000000000000000000000,
  "lm_eth_after": 998500000000000000000,
  "eth_extracted": 1500000000000000000,
  "floor_held": false,
  "verdict": "floor_broken" | "floor_held",
  "attacks": [
    {
      "strategy": "Flash buy + stake + recenter loop",
      "pattern": "wrap → buy → stake → recenter_multi → sell",
      "result": "DECREASED" | "HELD" | "INCREASED",
      "delta_bps": -150,
      "insight": "Rapid recenters pack ETH into floor while ratcheting it toward current price"
    }
  ]
}
Field Type Description
date string (ISO) Date of the run
candidate string Optimizer under test
optimizer_profile string Named profile / push3 variant
lm_eth_before integer (wei) LM total ETH at start
lm_eth_after integer (wei) LM total ETH at end
eth_extracted integer (wei) lm_eth_before - lm_eth_after (0 if floor held)
floor_held boolean true if no ETH was extracted
verdict string "floor_held" or "floor_broken"
attacks[].strategy string Human-readable strategy name
attacks[].pattern string Abstract op sequence (e.g. wrap → buy → stake)
attacks[].result string "DECREASED", "HELD", or "INCREASED"
attacks[].delta_bps integer LM ETH change in basis points
attacks[].insight string Key finding from this strategy

Schema: holdout/YYYY-MM-DD-prNNN.json

Records a holdout quality gate evaluation for a specific PR.

{
  "date": "YYYY-MM-DD",
  "pr": 123,
  "candidate_commit": "abc1234",
  "scenarios": [
    {
      "name": "bear_market_crash",
      "passed": true,
      "lm_eth_delta_bps": 12,
      "notes": ""
    },
    {
      "name": "flash_buy_exploit",
      "passed": false,
      "lm_eth_delta_bps": -340,
      "notes": "Floor broken on 2000-trade run"
    }
  ],
  "scenarios_passed": 4,
  "scenarios_total": 5,
  "gate_passed": false,
  "verdict": "pass" | "fail",
  "blocking_scenarios": ["flash_buy_exploit"]
}
Field Type Description
date string (ISO) Date of evaluation
pr integer PR number being evaluated
candidate_commit string Commit SHA under test
scenarios array One entry per holdout scenario
scenarios[].name string Scenario identifier
scenarios[].passed boolean Whether LM ETH held or improved
scenarios[].lm_eth_delta_bps integer LM ETH change in basis points
scenarios[].notes string Free-text notes on failure mode
scenarios_passed integer Count of passing scenarios
scenarios_total integer Total scenarios run
gate_passed boolean true if all required scenarios passed
verdict string "pass" or "fail"
blocking_scenarios array of strings Scenario names that caused failure

Schema: user-test/YYYY-MM-DD.json

Records a UX evaluation run across simulated personas.

{
  "date": "YYYY-MM-DD",
  "personas": [
    {
      "name": "crypto_native",
      "task": "stake_and_set_tax_rate",
      "completed": true,
      "friction_points": [],
      "screenshot_refs": ["tmp/screenshots/crypto_native_stake.png"],
      "notes": ""
    },
    {
      "name": "defi_newcomer",
      "task": "first_buy_and_stake",
      "completed": false,
      "friction_points": ["Tax rate slider label unclear", "No confirmation of stake tx"],
      "screenshot_refs": ["tmp/screenshots/defi_newcomer_confused.png"],
      "notes": "User abandoned at tax rate step"
    }
  ],
  "personas_completed": 1,
  "personas_total": 2,
  "critical_friction_points": ["Tax rate slider label unclear"],
  "verdict": "pass" | "fail"
}
Field Type Description
date string (ISO) Date of evaluation
personas array One entry per simulated persona
personas[].name string Persona identifier
personas[].task string Task the persona attempted
personas[].completed boolean Whether the task was completed
personas[].friction_points array of strings UX issues encountered
personas[].screenshot_refs array of strings Repo-relative paths to screenshots
personas[].notes string Free-text observations
personas_completed integer Count of personas who completed their task
personas_total integer Total personas evaluated
critical_friction_points array of strings Friction points that blocked task completion
verdict string "pass" if all personas completed, "fail" otherwise