Apr 22, 2026·1 min readResearch Papers

RespondeoQA launches first Latin-English QA benchmark

Marisa Hudspeth, Patrick J. Burns, and Brendan O'Connor introduce RespondeoQA, the first question-answering benchmark centered on Latin, containing ~7,800 bilingual Latin-English question-answer pairs drawn from pedagogical sources spanning the 1800s to the present.

ArXiv·Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor

Read at source

Composite

5.7

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Practitioners building or evaluating LLMs for low-resource or classical languages can use RespondeoQA as a concrete benchmark to probe model weaknesses in skill-based linguistic tasks, and adapt its creation pipeline for other underrepresented languages.

01RespondeoQA contains approximately 7,800 bilingual Latin-English question-answer pairs.
02Questions are drawn from Latin pedagogical sources ranging from the 1800s to the present, including exams, quizbowl-style trivia, and textbooks.
03The benchmark covers knowledge- and skill-based questions, multihop reasoning, constrained translation, and mixed language pairs.

Summary— our read of the original

Marisa Hudspeth, Patrick J. Burns, and Brendan O'Connor introduce RespondeoQA, which they describe as the first QA benchmark centered on Latin. The dataset contains approximately 7,800 question-answer pairs drawn from Latin pedagogical sources spanning the 1800s to the present, including exams, quizbowl-style trivia, and textbooks. After automated extraction and cleaning followed by manual review, the benchmark covers a diverse range of question types: knowledge- and skill-based questions, multihop reasoning, constrained translation, and mixed language pairs across Latin and English.

As a case study, the authors evaluate three large language models — LLaMA 3, Qwen QwQ, and OpenAI's o3-mini — against the benchmark.

As a case study, the authors evaluate three large language models — LLaMA 3, Qwen QwQ, and OpenAI's o3-mini — against the benchmark. All three models perform worse on skill-oriented questions. Reasoning models show some advantage on scansion and literary-device tasks, but offer limited overall improvement. QwQ performs slightly better on questions posed in Latin, while LLaMA 3 and o3-mini show more task-dependent variation in performance.

The authors position RespondeoQA as a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and note that the dataset creation pipeline can be readily adapted for other languages. The dataset is publicly available at the SlangLab GitHub repository.

Key facts

01RespondeoQA contains approximately 7,800 bilingual Latin-English question-answer pairs.
02Questions are drawn from Latin pedagogical sources ranging from the 1800s to the present, including exams, quizbowl-style trivia, and textbooks.
03The benchmark covers knowledge- and skill-based questions, multihop reasoning, constrained translation, and mixed language pairs.
04The authors claim this is the first QA benchmark centered on Latin.
05Three models were evaluated: LLaMA 3, Qwen QwQ, and OpenAI's o3-mini — all performed worse on skill-oriented questions.
06Reasoning models showed some improvement on scansion and literary-device tasks but limited overall gains.
07QwQ performed slightly better on questions asked in Latin; LLaMA 3 and o3-mini showed more task-dependent performance.

Topics

#benchmarks #qa-systems #multilingual #llm-evaluation #dataset

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 23, 2026 · 11:04 UTC. How this works →

RespondeoQA launches first Latin-English QA benchmark

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics