RespondeoQA launches first Latin-English QA benchmark
Marisa Hudspeth, Patrick J. Burns, and Brendan O'Connor introduce RespondeoQA, the first question-answering benchmark centered on Latin, containing ~7,800 bilingual Latin-English question-answer pairs drawn from pedagogical sources spanning the 1800s to the present.
Score breakdown
Practitioners building or evaluating LLMs for low-resource or classical languages can use RespondeoQA as a concrete benchmark to probe model weaknesses in skill-based linguistic tasks, and adapt its creation pipeline for other underrepresented languages.
- 01RespondeoQA contains approximately 7,800 bilingual Latin-English question-answer pairs.
- 02Questions are drawn from Latin pedagogical sources ranging from the 1800s to the present, including exams, quizbowl-style trivia, and textbooks.
- 03The benchmark covers knowledge- and skill-based questions, multihop reasoning, constrained translation, and mixed language pairs.
Marisa Hudspeth, Patrick J. Burns, and Brendan O'Connor introduce RespondeoQA, which they describe as the first QA benchmark centered on Latin. The dataset contains approximately 7,800 question-answer pairs drawn from Latin pedagogical sources spanning the 1800s to the present, including exams, quizbowl-style trivia, and textbooks. After automated extraction and cleaning followed by manual review, the benchmark covers a diverse range of question types: knowledge- and skill-based questions, multihop reasoning, constrained translation, and mixed language pairs across Latin and English.
As a case study, the authors evaluate three large language models — LLaMA 3, Qwen QwQ, and OpenAI's o3-mini — against the benchmark.
As a case study, the authors evaluate three large language models — LLaMA 3, Qwen QwQ, and OpenAI's o3-mini — against the benchmark. All three models perform worse on skill-oriented questions. Reasoning models show some advantage on scansion and literary-device tasks, but offer limited overall improvement. QwQ performs slightly better on questions posed in Latin, while LLaMA 3 and o3-mini show more task-dependent variation in performance.
The authors position RespondeoQA as a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and note that the dataset creation pipeline can be readily adapted for other languages. The dataset is publicly available at the SlangLab GitHub repository.
Key facts
- 01RespondeoQA contains approximately 7,800 bilingual Latin-English question-answer pairs.
- 02Questions are drawn from Latin pedagogical sources ranging from the 1800s to the present, including exams, quizbowl-style trivia, and textbooks.
- 03The benchmark covers knowledge- and skill-based questions, multihop reasoning, constrained translation, and mixed language pairs.
- 04The authors claim this is the first QA benchmark centered on Latin.
- 05Three models were evaluated: LLaMA 3, Qwen QwQ, and OpenAI's o3-mini — all performed worse on skill-oriented questions.
- 06Reasoning models showed some improvement on scansion and literary-device tasks but limited overall gains.
- 07QwQ performed slightly better on questions asked in Latin; LLaMA 3 and o3-mini showed more task-dependent performance.