Jun 9, 2026·1 min readResearch Papers

MPC-Patch-Bench exposes LLMs' weak cryptographic patch skills

Yukuan Zhang, Mengxin Zheng, and Qian Lou introduce MPC-Patch-Bench, the first repository-level benchmark for evaluating LLM code repair on Secure Multi-Party Computation software, revealing that the strongest tested LLM verifiably resolves only 17.1% of tasks.

ArXiv·Yukuan Zhang, Mengxin Zheng, Qian Lou

Read at source

Composite

6.7

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

The benchmark reveals that functional pass rates overstate LLM patch quality on security-critical MPC code by up to 40%, establishing that cryptographic and numerical-fidelity verification is a necessary — and currently missing — evaluation layer for agentic code repair in this domain.

01No repository-level benchmark for evaluating LLM code repair on MPC software previously existed.
02SWE-bench and similar general-purpose benchmarks fail on MPC codebases for three structural reasons identified by the authors.
03MPC-Patch-Bench contains 205 fully verified instances curated via a domain-specific agent filtering pull requests through three cryptographic layers.

Summary— our read of the original

Yukuan Zhang, Mengxin Zheng, and Qian Lou present MPC-Patch-Bench, a repository-level benchmark targeting a gap in LLM evaluation: no existing benchmark assesses code repair on Secure Multi-Party Computation (MPC) software. The paper identifies three reasons why transplanting general-purpose benchmarks such as SWE-bench fails in this domain — MPC repositories are dominated by generic Python infrastructure rather than cryptographic logic, high-value MPC fixes lack the standardized tests that rigid extraction pipelines require, and standard fail-to-pass evaluation is insufficient for code that must also satisfy cryptographic safety guarantees. MPC is described as increasingly deployed for privacy-preserving machine learning, biomedical collaboration, and secure analytics, while existing MPC-specific code-synthesis efforts cover only operator-level or single-framework tasks.

Evaluation results reveal a stark gap between functional and verified performance: the strongest LLM tested functionally resolves only 22.9% of tasks, and the MPC Verifier reduces that figure further to 17.1%.

MPC-Patch-Bench is organized around two components. The Data Curation Framework combines a domain-specific curation agent that filters raw pull requests through three cryptographic layers with a human-AI completion engine that synthesizes missing problem statements and Fail-to-Pass/Pass-to-Pass tests, producing 205 fully verified instances. The MPC Verifier adds dedicated security and numerical-fidelity checks through dynamic differential testing against plaintext oracles and MPC-specific static analysis rules that flag unsafe reveals, insecure arithmetic, and illegal public/private casts.

Evaluation results reveal a stark gap between functional and verified performance: the strongest LLM tested functionally resolves only 22.9% of tasks, and the MPC Verifier reduces that figure further to 17.1%. Critically, up to 40% of functionally-passing patches are rejected for cryptographic or numerical-fidelity violations, demonstrating that functional correctness alone is a misleading signal for security-critical MPC code repair.

Key facts

01No repository-level benchmark for evaluating LLM code repair on MPC software previously existed.
02SWE-bench and similar general-purpose benchmarks fail on MPC codebases for three structural reasons identified by the authors.
03MPC-Patch-Bench contains 205 fully verified instances curated via a domain-specific agent filtering pull requests through three cryptographic layers.
04The MPC Verifier uses dynamic differential testing against plaintext oracles and static analysis rules flagging unsafe reveals, insecure arithmetic, and illegal public/private casts.
05The strongest evaluated LLM functionally resolves only 22.9% of MPC-Patch-Bench tasks.
06After MPC Verifier checks, verified resolution drops to 17.1%.
07Up to 40% of functionally-passing patches are rejected for cryptographic or numerical-fidelity violations.

Topics

#benchmarks #code-generation #safety #llm-evaluation #mpc

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 11, 2026 · 08:34 UTC. How this works →

MPC-Patch-Bench exposes LLMs' weak cryptographic patch skills

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.