AgentHarness open-sources eval framework and small verifier sub-agent weights
The Apodex team released Apodex 1.0 alongside AgentHarness, an open-source eval/orchestration framework for long-horizon agent loops, and a family of small open-weight verifier models including the Smol Series (0.8B, 2B, 4B) and Apodex-1.0-mini (35B-A3B).
Score breakdown
AgentHarness introduces a concrete open-source pattern for separating verification from the main reasoning model in long-horizon agent loops, with purpose-built small weights that reportedly outperform much larger open-source models on BrowseComp benchmarks.
- 01Apodex 1.0 was released alongside the open-sourcing of AgentHarness and a family of small open-weight verifier models.
- 02AgentHarness is an eval/orchestration framework designed to prevent long-horizon agent loops from drifting into uncontrolled multi-step spirals.
- 03The framework tracks each step and allows small local models to validate structured tool calls before they hit a real API.
The Apodex team announced Apodex 1.0 in a post to r/LangChain, simultaneously open-sourcing AgentHarness and a family of small open-weight models intended to serve as verifier sub-agents within larger agent stacks. AgentHarness is the internal framework Apodex uses to run and evaluate long-horizon agent workflows, with a focus on preventing episodes from drifting into uncontrolled multi-step spirals. It tracks each step, supports plugging in small local models as verifier sub-agents, and lets those verifiers validate structured tool calls before they are dispatched to a real API. For LangGraph users, the post describes it as something that sits alongside an existing graph for reproducible eval and per-step verification rather than a replacement.
The core architectural pattern separates verification from the main reasoning model.
The core architectural pattern separates verification from the main reasoning model. Rather than relying on a single ReAct loop to self-reflect within one context window, a dedicated verification team — composed of smaller models — audits claims produced by the upstream LLM, confirms tool calls are structurally sound before dispatch, and reconciles conflicting evidence. To support this pattern, Apodex open-weighted the Apodex-1.0 Smol Series (0.8B, 2B, and 4B), which are SFT-tuned for verification and tool-call checking rather than chat, as well as Apodex-1.0-mini (35B-A3B) as a heavier self-hostable option. All models run on the same runtime, AgentOS. Notably, Apodex-1.0-4B-SFT, trained on a deep-research mixture, is reported to score 48.8 on BrowseComp and 63.5 on BrowseComp-ZH, surpassing OpenSeeker-v2-30B-SFT (46.0 and 58.1 respectively) despite being a fraction of the size.
Key facts
- 01Apodex 1.0 was released alongside the open-sourcing of AgentHarness and a family of small open-weight verifier models.
- 02AgentHarness is an eval/orchestration framework designed to prevent long-horizon agent loops from drifting into uncontrolled multi-step spirals.
- 03The framework tracks each step and allows small local models to validate structured tool calls before they hit a real API.
- 04The Apodex-1.0 Smol Series includes 0.8B, 2B, and 4B models tuned for verification and tool-call checking rather than general chat.
- 05Apodex-1.0-mini (35B-A3B) is offered as a heavier self-hostable verifier option; all models run on the AgentOS runtime.
- 06Apodex-1.0-4B-SFT scores 48.8 on BrowseComp and 63.5 on BrowseComp-ZH, beating OpenSeeker-v2-30B-SFT (46.0 and 58.1) despite being far smaller.
- 07The post solicits feedback from LangGraph builders on how per-step verifier nodes would fit into their pipelines.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 11, 2026 · 08:34 UTC. How this works →