Jun 15, 2026·1 min readResearch Papers

AgentFairBench measures demographic bias in LLM agent actions

Researchers introduce AgentFairBench, a reproducible benchmark that measures demographic disparity in the actions of LLM agents across hiring, lending, and medical triage domains — not just their text outputs.

ArXiv·Triveni Morla, Rohith Reddy Bellibaltu, Manpreet Singh

Read at source

Composite

7.1

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

AgentFairBench shifts fairness measurement from LLM text outputs to agent decisions in consequential domains, and its arity-matched null methodology corrects a ~2.4× overstatement of disparity that prior comparison approaches produce.

01AgentFairBench benchmarks demographic disparity in LLM agent *actions* (not text outputs) across hiring, lending, and medical triage domains.
02It is grounded in the Bias Conduction Framework (BCF) and uses counterfactual matched profiles varying only a name-coded race × gender signal, in the Bertrand Mullainathan tradition.
03Four agent scaffolds of increasing agency are tested: direct, chain-of-thought, multi-agent deliberation, and tool-augmented.

Summary— our read of the original

Triveni Morla, Rohith Reddy Bellibaltu, and Manpreet Singh argue that existing LLM fairness evaluations grade text answers rather than agent actions, leaving a critical gap as LLM agents increasingly screen job applicants, recommend credit decisions, and triage patients. AgentFairBench addresses this by providing a cheap, reproducible, multi-domain benchmark grounded in the Bias Conduction Framework (BCF). It uses synthetic, demographic-neutral profiles evaluated in counterfactual matched sets that vary only a name-coded race × gender signal — following the Bertrand Mullainathan tradition — across three regulator-anchored domains: hiring, lending, and medical triage. Agents are tested under four scaffolds of increasing agency: direct, chain-of-thought, multi-agent deliberation, and tool-augmented.

A live leaderboard with a held-out private split and a contamination canary allows external model submissions.

The benchmark's NumPy-only harness computes four metrics — counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity — with bootstrap confidence intervals, paired tests, and false-discovery-rate control, all for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary allows external model submissions.

A central methodological contribution is the arity-matched null methodology: the authors show that naively comparing a six-group score spread against a two-run noise difference overstates disparity by approximately 2.4× due to statistic arity alone. Applying the corrected approach to their pilot of 864 decisions (plus a test-retest replication), `claude haiku 4 5` showed no demographic effect above sampling noise — 0 of 120 pairwise and 0 of 9 omnibus contrasts survived correction. A planted-bias test confirmed the instrument successfully detects disparity when it is genuinely present. Code, data, and the harness are released under open licenses.

Key facts

01AgentFairBench benchmarks demographic disparity in LLM agent *actions* (not text outputs) across hiring, lending, and medical triage domains.
02It is grounded in the Bias Conduction Framework (BCF) and uses counterfactual matched profiles varying only a name-coded race × gender signal, in the Bertrand Mullainathan tradition.
03Four agent scaffolds of increasing agency are tested: direct, chain-of-thought, multi-agent deliberation, and tool-augmented.
04The harness computes counterfactual flip rate, MASD, action-rate disparity, and tool-invocation disparity, with bootstrap CIs and false-discovery-rate control, for single-digit dollars per model.
05A key methodological finding: comparing a six-group score spread against a two-run noise difference overstates disparity by ~2.4× through statistic arity alone.
06In a pilot of 864 decisions, `claude haiku 4 5` showed 0 of 120 pairwise and 0 of 9 omnibus contrasts surviving correction — no demographic effect above sampling noise.
07A planted-bias test confirmed the instrument detects disparity when it is genuinely present; code, data, and harness are released under open licenses.

Topics

#benchmarks #safety #agent-framework #fairness #bias-detection

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →

Jun 15, 2026·1 min readResearch Papers

AgentFairBench measures demographic bias in LLM agent actions

ArXiv·Triveni Morla, Rohith Reddy Bellibaltu, Manpreet Singh

Read at source

Composite

7.1

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01AgentFairBench benchmarks demographic disparity in LLM agent *actions* (not text outputs) across hiring, lending, and medical triage domains.
02It is grounded in the Bias Conduction Framework (BCF) and uses counterfactual matched profiles varying only a name-coded race × gender signal, in the Bertrand Mullainathan tradition.
03Four agent scaffolds of increasing agency are tested: direct, chain-of-thought, multi-agent deliberation, and tool-augmented.

Summary— our read of the original

A live leaderboard with a held-out private split and a contamination canary allows external model submissions.

Key facts

01AgentFairBench benchmarks demographic disparity in LLM agent *actions* (not text outputs) across hiring, lending, and medical triage domains.
02It is grounded in the Bias Conduction Framework (BCF) and uses counterfactual matched profiles varying only a name-coded race × gender signal, in the Bertrand Mullainathan tradition.
03Four agent scaffolds of increasing agency are tested: direct, chain-of-thought, multi-agent deliberation, and tool-augmented.
04The harness computes counterfactual flip rate, MASD, action-rate disparity, and tool-invocation disparity, with bootstrap CIs and false-discovery-rate control, for single-digit dollars per model.
05A key methodological finding: comparing a six-group score spread against a two-run noise difference overstates disparity by ~2.4× through statistic arity alone.
06In a pilot of 864 decisions, `claude haiku 4 5` showed 0 of 120 pairwise and 0 of 9 omnibus contrasts surviving correction — no demographic effect above sampling noise.
07A planted-bias test confirmed the instrument detects disparity when it is genuinely present; code, data, and harness are released under open licenses.

Topics

#benchmarks #safety #agent-framework #fairness #bias-detection

Methodology

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.