DeepRoot cuts hallucinations from 87% to 7–10% in historical drug discovery
DeepRoot, a multi-agent LLM system by Zijian Carl Ma, Sean J. Wang, and Sijbren Kramer, jointly builds and queries a verified knowledge graph to extract drug-discovery leads from historical medical texts, dramatically outperforming baseline LLMs on both accuracy and reasoning quality.
Score breakdown
DeepRoot is the first system to simultaneously achieve low hallucination rates (7–10%) and high reasoning coherence on historical medical text, demonstrating a viable path for converting pre-ontological archives into verifiable drug-discovery leads at scale.
- 01DeepRoot is a multi-agent LLM system that jointly builds and queries a verified knowledge graph for therapeutic reasoning over historical medical texts.
- 02It was applied to the Shen Nong Ben Cao Jing, a classical Chinese medical text.
- 03DeepRoot recovers 10 of 21 held-out compound-disease treatment pairs at R@20, a 47.6% rate.
Zijian Carl Ma, Sean J. Wang, and Sijbren Kramer present DeepRoot, a multi-agent LLM system designed to bridge the gap between pre-ontological historical medical archives and modern biomedical drug-discovery pipelines. The core challenge is that historical texts like the *Shen Nong Ben Cao Jing* use idiosyncratic taxonomies and prose that resist standardization, and no existing LLM agent approach — whether tool-calling, retrieval-augmented, or agentic deep-research — can reliably convert such text into verifiable drug-discovery leads at scale. DeepRoot addresses this by jointly building and utilizing a verified knowledge graph, demonstrating that grounding and reasoning are separable, composable axes rather than a single conflated capability.
On a hallucination audit conducted via an LLM-as-judge framework, tool-using LLMs hallucinate evidence on 87% of claims, while DeepRoot's hallucination rate falls to just 7–10%.
In benchmark evaluation on the *Shen Nong Ben Cao Jing*, DeepRoot recovers 10 of 21 held-out compound-disease treatment pairs at R@20, achieving 47.6% versus 4.8% for a raw corpus LLM and approximately 2.4% for random retrieval. On a hallucination audit conducted via an LLM-as-judge framework, tool-using LLMs hallucinate evidence on 87% of claims, while DeepRoot's hallucination rate falls to just 7–10%. A graph-only inference condition achieves 0% hallucination but ranks lowest on reasoning coherence. DeepRoot's combined KG+LLM configuration is the only tested condition to achieve top performance on both factual grounding and reasoning quality, pointing toward a systematic route for mining and repurposing historical medical knowledge for modern drug discovery.
Key facts
- 01DeepRoot is a multi-agent LLM system that jointly builds and queries a verified knowledge graph for therapeutic reasoning over historical medical texts.
- 02It was applied to the Shen Nong Ben Cao Jing, a classical Chinese medical text.
- 03DeepRoot recovers 10 of 21 held-out compound-disease treatment pairs at R@20, a 47.6% rate.
- 04A raw corpus LLM achieves only 4.8% at R@20; random retrieval achieves ~2.4%.
- 05Tool-using LLMs hallucinate evidence on 87% of claims; DeepRoot reduces this to 7–10%.
- 06Graph-only inference achieves 0% hallucination but ranks lowest on reasoning coherence.
- 07DeepRoot's KG+LLM combination is the only tested condition to win on both factual grounding and reasoning coherence.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →