AgentFairBench measures demographic bias in LLM agent actions
Researchers introduce AgentFairBench, a reproducible benchmark that measures demographic disparity in the actions of LLM agents across hiring, lending, and medical triage domains — not just their text outputs.
Score breakdown
AgentFairBench shifts fairness measurement from LLM text outputs to agent decisions in consequential domains, and its arity-matched null methodology corrects a ~2.4× overstatement of disparity that prior comparison approaches produce.
- 01AgentFairBench benchmarks demographic disparity in LLM agent *actions* (not text outputs) across hiring, lending, and medical triage domains.
- 02It is grounded in the Bias Conduction Framework (BCF) and uses counterfactual matched profiles varying only a name-coded race × gender signal, in the Bertrand Mullainathan tradition.
- 03Four agent scaffolds of increasing agency are tested: direct, chain-of-thought, multi-agent deliberation, and tool-augmented.
Triveni Morla, Rohith Reddy Bellibaltu, and Manpreet Singh argue that existing LLM fairness evaluations grade text answers rather than agent actions, leaving a critical gap as LLM agents increasingly screen job applicants, recommend credit decisions, and triage patients. AgentFairBench addresses this by providing a cheap, reproducible, multi-domain benchmark grounded in the Bias Conduction Framework (BCF). It uses synthetic, demographic-neutral profiles evaluated in counterfactual matched sets that vary only a name-coded race × gender signal — following the Bertrand Mullainathan tradition — across three regulator-anchored domains: hiring, lending, and medical triage. Agents are tested under four scaffolds of increasing agency: direct, chain-of-thought, multi-agent deliberation, and tool-augmented.
A live leaderboard with a held-out private split and a contamination canary allows external model submissions.
The benchmark's NumPy-only harness computes four metrics — counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity — with bootstrap confidence intervals, paired tests, and false-discovery-rate control, all for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary allows external model submissions.
A central methodological contribution is the arity-matched null methodology: the authors show that naively comparing a six-group score spread against a two-run noise difference overstates disparity by approximately 2.4× due to statistic arity alone. Applying the corrected approach to their pilot of 864 decisions (plus a test-retest replication), `claude haiku 4 5` showed no demographic effect above sampling noise — 0 of 120 pairwise and 0 of 9 omnibus contrasts survived correction. A planted-bias test confirmed the instrument successfully detects disparity when it is genuinely present. Code, data, and the harness are released under open licenses.
Key facts
- 01AgentFairBench benchmarks demographic disparity in LLM agent *actions* (not text outputs) across hiring, lending, and medical triage domains.
- 02It is grounded in the Bias Conduction Framework (BCF) and uses counterfactual matched profiles varying only a name-coded race × gender signal, in the Bertrand Mullainathan tradition.
- 03Four agent scaffolds of increasing agency are tested: direct, chain-of-thought, multi-agent deliberation, and tool-augmented.
- 04The harness computes counterfactual flip rate, MASD, action-rate disparity, and tool-invocation disparity, with bootstrap CIs and false-discovery-rate control, for single-digit dollars per model.
- 05A key methodological finding: comparing a six-group score spread against a two-run noise difference overstates disparity by ~2.4× through statistic arity alone.
- 06In a pilot of 864 decisions, `claude haiku 4 5` showed 0 of 120 pairwise and 0 of 9 omnibus contrasts surviving correction — no demographic effect above sampling noise.
- 07A planted-bias test confirmed the instrument detects disparity when it is genuinely present; code, data, and harness are released under open licenses.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →