DiZiNER framework closes zero-shot NER gap using LLM disagreement
DiZiNER, a new framework by Siun Kim and Hyung-Jin Yoon, simulates human pilot annotation by having multiple LLMs annotate text and a supervisor model refine instructions based on inter-model disagreements, achieving zero-shot state-of-the-art NER results on 14 of 18 benchmarks.
Score breakdown
Teams building zero-shot information extraction pipelines can adopt DiZiNER's disagreement-guided instruction refinement approach to significantly close the gap with supervised NER systems without requiring labeled training data.
- 01DiZiNER simulates the human pilot annotation process using multiple heterogeneous LLMs as annotators and a supervisor model to resolve disagreements.
- 02The supervisor model analyzes inter-model disagreements to iteratively refine task instructions for zero-shot NER.
- 03DiZiNER achieves zero-shot state-of-the-art results on 14 out of 18 NER benchmarks.
Siun Kim and Hyung-Jin Yoon present DiZiNER (Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition), a framework designed to close the persistent performance gap between zero-shot and supervised NER systems. The authors observe that recurring errors in LLM-based NER mirror the inconsistencies seen in early-stage human annotation, where disagreements among annotators are resolved through a structured pilot annotation phase. DiZiNER operationalizes this analogy by deploying multiple heterogeneous LLMs as annotators on shared texts, then using a supervisor model to analyze inter-model disagreements and iteratively refine the task instructions.
This interpretation is further supported by a strong observed correlation between pairwise inter-model agreement and NER performance across benchmarks.
Evaluated across 18 NER benchmarks, DiZiNER achieves zero-shot state-of-the-art results on 14 datasets, improving over prior best results by +8.0 F1 and reducing the zero-shot-to-supervised performance gap by more than +11 points. A key finding is that DiZiNER consistently outperforms its supervisor model, GPT-5 mini, which the authors interpret as evidence that the performance gains are attributable to the disagreement-guided instruction refinement mechanism rather than the underlying capacity of any single model. This interpretation is further supported by a strong observed correlation between pairwise inter-model agreement and NER performance across benchmarks.
Key facts
- 01DiZiNER simulates the human pilot annotation process using multiple heterogeneous LLMs as annotators and a supervisor model to resolve disagreements.
- 02The supervisor model analyzes inter-model disagreements to iteratively refine task instructions for zero-shot NER.
- 03DiZiNER achieves zero-shot state-of-the-art results on 14 out of 18 NER benchmarks.
- 04It improves prior best zero-shot results by +8.0 F1.
- 05It reduces the zero-shot-to-supervised performance gap by over +11 points.
- 06DiZiNER consistently outperforms its supervisor model, GPT-5 mini, indicating gains come from the refinement process, not model capacity.
- 07Pairwise inter-model agreement shows a strong correlation with NER performance, supporting the disagreement-guided approach.