Lab Notes · v1

How ValueArena measures character.

A full walk-through of the pipeline: how judgments are collected, how skills are fit with Bradley–Terry–Davidson, how uncertainty is quantified via non-parametric bootstrap, how judge trust is aggregated with EigenTrust, and how the final Elo numbers are pegged to a fixed anchor across constitutions.

Last updated 2026-04-17Code invi-bhagyesh/EigenBenchData invi-bhagyesh/ValueArena

Pipeline overview

Every ValueArena run starts with a spec: a constitution, a set of models, and a slice of scenarios. The spec drives five stages — collection, BTD fitting, bootstrap, EigenTrust, and upload — producing a single published row on the leaderboard.

Each stage is deterministic given its inputs, so a run can be re-played from the raw judgments without re-querying any model. The artifacts on HuggingFace (meta.json, summary.json, evaluations.jsonl) are sufficient to reproduce every number on the site.

Constitutions & scenarios

A constitution is a short document — typically 3–7 numbered criteria written in the second person — that defines the trait under evaluation (goodness, sarcasm, misalignment, and so on). Criteria are operational: each one names an observable behavior a judge can check against a transcript.

A scenario is a prompt that elicits behavior relevant to the constitution. The scenario set is fixed across all runs of the same constitution, so Elo comparisons across models are always over matched prompt distributions.

Collection: pairwise judgments

For each scenario $s$ and each ordered pair of contestants $(i, j)$ , a judge $k$ is sampled from the judge pool. The judge reads the constitution, the scenario, and the two anonymized responses, and returns one of {i wins, j wins, tie}. Results are appended to evaluations.jsonl — one JSON line per judgment.

Two sampler modes are supported:

btd_d2: Round-robin at scenario level, diameter-2 contestant graph — every model plays every other on every scenario. Used for small pools (≤8 contestants).
uniform: Uniform random triads subject to a target games-per-model budget. Used for larger pools where full round-robin would be prohibitive.

The raw judgment tensor $W \in N^{M \times M \times K}$ counts, for each contestant pair and each judge, the number of wins of row over column. Ties contribute $\frac{1}{2}$ to both $W_{ij}$ and $W_{j i}$ when passed to the simple BTD fit; the full Davidson variant (§04) treats them as their own outcome.

Bradley–Terry–Davidson

Given a strength parameter $β_{i}$ per contestant, the Bradley–Terry model says the probability that $i$ beats $j$ on a single trial is

P (i ≻ j) = σ (β_{i} - β_{j}) = \frac{1}{1 + e ^{- (β_{i} - β_{j})}}

Davidson's extension adds a tie parameter $ν \geq 0$ (a nuisance parameter shared across pairs). Under Davidson, the three-way likelihood on a single pair is

P (i ≻ j) P (j ≻ i) P (tie) = \frac{e ^{β_{i}}}{e ^{β_{i}} + e ^{β_{j}} + ν e ^{β_{i} + β_{j}}} = \frac{e ^{β_{j}}}{e ^{β_{i}} + e ^{β_{j}} + ν e ^{β_{i} + β_{j}}} = \frac{ν e ^{β_{i} + β_{j}}}{e ^{β_{i}} + e ^{β_{j}} + ν e ^{β_{i} + β_{j}}}

We fit $β, ν$ by maximizing the total log-likelihood over all judgments, with an $ℓ_{2}$ regularizer on $β$ to pin down the global shift (the model is translation-invariant) and stabilize the fit when a contestant has very lopsided results. Optimization uses L-BFGS; convergence is reached in a few dozen iterations.

Bootstrap intervals

Point estimates of $β$ are noisy — a single lucky win on a small scenario set can shift an Elo by tens of points. To report 95% CIs we use a non-parametric bootstrap at the judgment level: resample the rows of evaluations.jsonl with replacement, refit BTD, and collect the resulting $β^{(b)}$ for $b = 1, \dots, B = 1000$ .

CI_{95} (β_{i}) = [q_{0.025} ({β_{i}^{(b)}}_{b}), q_{0.975} ({β_{i}^{(b)}}_{b})]

The summary.json stored on HuggingFace records the mean (not the point MLE) and the two empirical quantiles per model. Using the bootstrap mean keeps consistency with the CI calculation and absorbs a small amount of non-identifiability at the boundary (models with 0% or 100% win rates).

EigenTrust

Not every judge is equally reliable. A weak or sycophantic model can pollute the win counts, biasing $β$ . Rather than hand-select judges, we let the judges vote on each other and solve for the stationary trust distribution — the classic EigenTrust setup adapted to the arena.

Let $C \in R^{K \times K}$ be the row-stochastic matrix where $C_{k l}$ is the fraction of times judge $k$ agrees with the BTD-implied ordering when judge $l$ would have disagreed with them. The trust vector $t$ is the stationary distribution of the damped chain

t^{(n + 1)} = (1 - a) C^{⊤} t^{(n)} + a p

where $p$ is a uniform prior over judges and $a = 0.1$ is the teleport probability. Iteration converges in under 50 steps. Final trust scores are stored per judge in meta.json and shown on the leaderboard hover cards.

Elo pegging across constitutions

BTD strengths live on an arbitrary log-odds scale — they're only identified up to a shift. For display we transform to Elo:

E_{i} = 1500 + \frac{400}{ln 10} \cdot β_{i} + c

The constant $c$ is what pegging chooses. Three reference models — gpt-4o, claude-4-sonnet, gemini-2.5-pro — are scored in every run, and $c$ is set so that their mean Elo equals 1500 within that run:

c = 1500 - \frac{1}{∣ R ∣} r \in R \sum (1500 + \frac{400}{ln 10} β_{r}) = - \frac{400}{ln 10} \cdot \overset{ˉ}{β}_{R}

where $R$ is the set of reference models present in the run.

Compute workflow

Collection, BTD fitting, and bootstrap are all CPU-bound; only model inference needs a GPU. Two paths exist, and we use the second for anything beyond single-spec experiments.

# train all 11 openchar runs locally in 3 parallel workers,
# then upload one constitution at a time
.venv/bin/python scripts/run_local_train_upload.py \
    --group openchar \
    --parallel 3

Limits & caveats

Judges are not neutral.Using frontier LLMs as judges imports their preferences. EigenTrust mitigates this somewhat — unreliable judges get down-weighted — but systematic agreement across the pool still shows up as “truth”.

Anchors may drift between constitutions. gpt-4ois not equally “average” on goodness and on misalignment. Pegging to the mean of three refs controls some of this, but cross-trait comparisons should be read as directional, not absolute.

Bootstrap is judgment-level, not scenario-level.If a single scenario happens to favor one model, resampling judgments won't erase it — only resampling scenarios would. For small scenario counts, CIs are therefore narrower than the true epistemic uncertainty.

Finite-sample BTD bias. When a contestant wins or loses every game, the MLE diverges; the ridge penalty pulls such strengths toward zero but not to any principled value. Ties (via the Davidson parameter) help, but rare.

Trait orthogonality is not enforced. Constitutions were written independently, and some traits correlate (e.g. loving and goodness tend to move together in our runs). The cross-constitution Pareto on the leaderboard visualizes this.