A 4B model beats Qwen 3 235B on financial tool use for under $500
Snorkel's research team used RL fine-tuning to make a 4-billion-parameter model outperform Qwen 3 235B on financial tool-use tasks, with the training run costing under $500.
Score breakdown
The results show that targeted RL fine-tuning on high-quality, task-specific data can close — and reverse — a 231-billion-parameter gap in model size, at a training cost under $500, on a real financial reasoning benchmark.
- 01Snorkel fine-tuned a 4B model with RL to outperform Qwen 3 235B on a financial tool-use task
- 02The training run cost under $500
- 03Qwen 3 235B hallucinated an answer after querying a non-existent table twice; the 4B model self-corrected and got the right answer
Kobie Crawford, developer advocate at Snorkel (the "Frontier AI Data Lab"), presented research demonstrating that targeted fine-tuning with high-quality data can enable a small model to outperform a much larger one on a specific class of tasks. The central example pitted Qwen 3 235B against a Snorkel-fine-tuned 4B model on a financial analysis tool-use task: given a question about YouTube's year-over-year ad revenue growth from 2023 to 2024, Qwen 3 235B queried a table that didn't exist, tried again, got nothing back both times, and hallucinated an answer. The fine-tuned 4B model, by contrast, called `get_table_name` first, inspected the schema, ran a query, hit a column error, self-corrected, and returned the correct answer — all for a training run that cost under $500. The research was conducted in partnership with UC Berkeley's RLLM Agentic Project.
Crawford argued that tool discipline — knowing how to explore and correct within a tool environment — matters more than raw reasoning depth for this class of tasks.
Crawford argued that tool discipline — knowing how to explore and correct within a tool environment — matters more than raw reasoning depth for this class of tasks. A key finding was that training on single-table problems transferred cleanly to harder multi-table problems, lifting scores on the FinQA reasoning benchmark from 13.9% to 26.6%. The talk also emphasized breaking evaluations into rubrics to identify which specific behavior needs fixing before writing any training data, reflecting Snorkel's broader focus on expert-in-the-loop data quality as the lever for model improvement.
Key facts
- 01Snorkel fine-tuned a 4B model with RL to outperform Qwen 3 235B on a financial tool-use task
- 02The training run cost under $500
- 03Qwen 3 235B hallucinated an answer after querying a non-existent table twice; the 4B model self-corrected and got the right answer
- 04The 4B model called `get_table_name` first to inspect the schema before querying
- 05Single-table training transferred to multi-table problems, improving FinQA benchmark scores from 13.9% to 26.6%
- 06The research was conducted in partnership with UC Berkeley's RLLM Agentic Project
- 07Crawford advocates breaking evals into rubrics to identify which specific behavior to fix before writing training data
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 11, 2026 · 08:34 UTC. How this works →