Jun 10, 2026·1 min readApplications & Use Cases

A 4B model beats Qwen 3 235B on financial tool use for under $500

Snorkel's research team used RL fine-tuning to make a 4-billion-parameter model outperform Qwen 3 235B on financial tool-use tasks, with the training run costing under $500.

YouTube: AI Engineer·AI Engineer

Read at source

Composite

5.9

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

The results show that targeted RL fine-tuning on high-quality, task-specific data can close — and reverse — a 231-billion-parameter gap in model size, at a training cost under $500, on a real financial reasoning benchmark.

01Snorkel fine-tuned a 4B model with RL to outperform Qwen 3 235B on a financial tool-use task
02The training run cost under $500
03Qwen 3 235B hallucinated an answer after querying a non-existent table twice; the 4B model self-corrected and got the right answer

Summary— our read of the original

Kobie Crawford, developer advocate at Snorkel (the "Frontier AI Data Lab"), presented research demonstrating that targeted fine-tuning with high-quality data can enable a small model to outperform a much larger one on a specific class of tasks. The central example pitted Qwen 3 235B against a Snorkel-fine-tuned 4B model on a financial analysis tool-use task: given a question about YouTube's year-over-year ad revenue growth from 2023 to 2024, Qwen 3 235B queried a table that didn't exist, tried again, got nothing back both times, and hallucinated an answer. The fine-tuned 4B model, by contrast, called `get_table_name` first, inspected the schema, ran a query, hit a column error, self-corrected, and returned the correct answer — all for a training run that cost under $500. The research was conducted in partnership with UC Berkeley's RLLM Agentic Project.

Crawford argued that tool discipline — knowing how to explore and correct within a tool environment — matters more than raw reasoning depth for this class of tasks.

Crawford argued that tool discipline — knowing how to explore and correct within a tool environment — matters more than raw reasoning depth for this class of tasks. A key finding was that training on single-table problems transferred cleanly to harder multi-table problems, lifting scores on the FinQA reasoning benchmark from 13.9% to 26.6%. The talk also emphasized breaking evaluations into rubrics to identify which specific behavior needs fixing before writing any training data, reflecting Snorkel's broader focus on expert-in-the-loop data quality as the lever for model improvement.

Key facts

01Snorkel fine-tuned a 4B model with RL to outperform Qwen 3 235B on a financial tool-use task
02The training run cost under $500
03Qwen 3 235B hallucinated an answer after querying a non-existent table twice; the 4B model self-corrected and got the right answer
04The 4B model called `get_table_name` first to inspect the schema before querying
05Single-table training transferred to multi-table problems, improving FinQA benchmark scores from 13.9% to 26.6%
06The research was conducted in partnership with UC Berkeley's RLLM Agentic Project
07Crawford advocates breaking evals into rubrics to identify which specific behavior to fix before writing training data

Topics

#fine-tuning #reasoning #tool-use #benchmarks #open-source

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 11, 2026 · 08:34 UTC. How this works →

Jun 10, 2026·1 min readApplications & Use Cases

A 4B model beats Qwen 3 235B on financial tool use for under $500

Snorkel's research team used RL fine-tuning to make a 4-billion-parameter model outperform Qwen 3 235B on financial tool-use tasks, with the training run costing under $500.

YouTube: AI Engineer·AI Engineer

Read at source

Composite

5.9

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Snorkel fine-tuned a 4B model with RL to outperform Qwen 3 235B on a financial tool-use task
02The training run cost under $500
03Qwen 3 235B hallucinated an answer after querying a non-existent table twice; the 4B model self-corrected and got the right answer

Summary— our read of the original

Crawford argued that tool discipline — knowing how to explore and correct within a tool environment — matters more than raw reasoning depth for this class of tasks.

Key facts

01Snorkel fine-tuned a 4B model with RL to outperform Qwen 3 235B on a financial tool-use task
02The training run cost under $500
03Qwen 3 235B hallucinated an answer after querying a non-existent table twice; the 4B model self-corrected and got the right answer
04The 4B model called `get_table_name` first to inspect the schema before querying
05Single-table training transferred to multi-table problems, improving FinQA benchmark scores from 13.9% to 26.6%
06The research was conducted in partnership with UC Berkeley's RLLM Agentic Project
07Crawford advocates breaking evals into rubrics to identify which specific behavior to fix before writing training data

Topics

#fine-tuning #reasoning #tool-use #benchmarks #open-source

Methodology

Score breakdown

Key facts

Topics

More in Applications & Use Cases.

Score breakdown

Key facts

Topics

More in Applications & Use Cases.