Skip to content

Commit 250ed20

Browse files
Merge pull request #22 from ScalingIntelligence/tanvir_dev
3 New papers
2 parents 5e2548d + df2f54f commit 250ed20

File tree

12 files changed

+118
-0
lines changed

12 files changed

+118
-0
lines changed

_pubs/monkeyspower.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
---
2+
title: 'How Do Large Language Monkeys Get Their Power (Laws)?'
3+
authors:
4+
- key: rylanschaeffer
5+
affiliation: University of California, Berkeley
6+
- key: joshuakazdan
7+
affiliation: Stanford University
8+
- key: johnhughes
9+
affiliation: University of Cambridge
10+
- name: Jordan Juravsky
11+
affiliation: University of California, Berkeley
12+
- name: Sara Price
13+
affiliation: University of Cambridge
14+
- name: Aengus Lynch
15+
affiliation: University of Cambridge
16+
- name: Erik Jones
17+
affiliation: University of Toronto
18+
- name: Robert Kirk
19+
affiliation: University of Cambridge
20+
- key: azaliamirhoseini
21+
affiliation: Google DeepMind
22+
- name: Sanmi Koyejo
23+
affiliation: Stanford University
24+
venue: arXiv preprint
25+
year: 2025
26+
month: February
27+
day: 24
28+
has_pdf: true
29+
doi: 10.48550/arXiv.2502.17578
30+
tags:
31+
- machine learning
32+
- scaling laws
33+
- generative AI
34+
- inference compute
35+
teaser:
36+
Recent research documents a surprising finding: increasing inference compute through repeated sampling reveals power-law scaling in average success rates. This occurs despite per-problem exponential scaling, explained by heavy-tailed distributions of single-attempt success rates. Our analysis unifies these observations under a mathematical framework, offering new insights into inference-time scaling laws and more efficient performance forecasting for language models.
37+
materials:
38+
- name: Paper
39+
url: https://arxiv.org/abs/2502.17578
40+
type: file-pdf
41+
---
42+
Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts. In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts. We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, ~2-4 orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.

_pubs/sdgrl.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
title: 'Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use'
3+
authors:
4+
- name: Anna Goldie
5+
affiliation: Stanford University
6+
email: agoldie@cs.stanford.edu
7+
equal: true
8+
- name: Azalia Mirhoseini
9+
affiliation: Stanford University
10+
email: azalia@cs.stanford.edu
11+
equal: true
12+
- name: Hao Zhou
13+
affiliation: Google DeepMind
14+
- name: Irene Cai
15+
affiliation: Google DeepMind
16+
- name: Christopher D. Manning
17+
affiliation: Stanford University
18+
venue: arXiv preprint
19+
year: 2025
20+
month: April
21+
day: 28
22+
has_pdf: true
23+
doi: 10.48550/arXiv.2504.04736
24+
tags:
25+
- reinforcement learning
26+
- language models
27+
- reasoning
28+
- tool use
29+
- synthetic data
30+
- generative AI
31+
teaser:
32+
This paper introduces Step-Wise Reinforcement Learning (SWiRL), a framework for improving multi-step reasoning and tool use in language models through synthetic data generation and offline RL. SWiRL decomposes reasoning trajectories into sub-trajectories, enabling fine-grained feedback and significant accuracy improvements across challenging tasks like HotPotQA, GSM8K, MuSiQue, CofCA, and BeerQA. Notably, SWiRL-trained models outperform larger proprietary models in multi-step reasoning while demonstrating strong task generalization and improved cost efficiency.
33+
materials:
34+
- name: Paper
35+
url: https://arxiv.org/abs/2504.04736
36+
type: file-pdf
37+
- name: Code Repository
38+
url: https://github.com/ScalingIntelligence/swirl_rl
39+
type: code
40+
- name: Synthetic Multi-Step Reasoning Dataset
41+
url: https://huggingface.co/datasets/ScalingIntelligence/swirl_synthetic_data
42+
type: database
43+
---
44+
Reinforcement learning has been shown to improve the performance of large language models. However, traditional approaches like RLHF or RLAIF treat the problem as single-step. As focus shifts toward more complex reasoning and agentic tasks, language models must take multiple steps of text generation, reasoning and environment interaction before generating a solution. We propose a synthetic data generation and RL methodology targeting multi-step optimization scenarios. This approach, called Step-Wise Reinforcement Learning (SWiRL), iteratively generates multi-step reasoning and tool use data, and then learns from that data. It employs a simple step-wise decomposition that breaks each multi-step trajectory into multiple sub-trajectories corresponding to each action by the original model. It then applies synthetic data filtering and RL optimization on these sub-trajectories. We evaluated SWiRL on a number of multi-step tool use, question answering, and mathematical reasoning tasks. Our experiments show that SWiRL outperforms baseline approaches by 21.5%, 12.3%, 14.8%, 11.1%, and 15.3% in relative accuracy on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA, respectively. Excitingly, the approach exhibits generalization across tasks: for example, training only on HotPotQA (text question-answering) improves zero-shot performance on GSM8K (a math dataset) by a relative 16.9%.

_pubs/tpt.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
---
2+
title: 'Think, Prune, Train, Improve: Scaling Reasoning Without Scaling Models'
3+
authors:
4+
- name: Caia Costello
5+
affiliation: Stanford University / Ceramic AI
6+
email: caia@stanford.edu
7+
- name: Simon Guo
8+
affiliation: Stanford University
9+
- name: Anna Goldie
10+
affiliation: Stanford University
11+
- key: azaliamirhoseini
12+
affiliation: Google DeepMind
13+
venue: arXiv preprint
14+
year: 2025
15+
month: April
16+
day: 25
17+
has_pdf: true
18+
doi: 10.48550/arXiv.2504.18116
19+
tags:
20+
- machine learning
21+
- language models
22+
- reasoning
23+
- self-improvement
24+
- generative AI
25+
teaser:
26+
This paper introduces the Think, Prune, Train (TPT) framework, a scalable method for improving language model reasoning without increasing model size. By iteratively fine-tuning models on their own reasoning traces and applying correctness-based pruning, TPT enables smaller models to achieve performance rivaling or exceeding larger ones. Experimental results on GSM8K and CodeContests show that models like Gemma-2B and LLaMA-70B-Instruct can surpass even GPT-4o on Pass@1 accuracy through recursive self-improvement.
27+
materials:
28+
- name: Paper
29+
url: https://arxiv.org/abs/2504.18116
30+
type: file-pdf
31+
---
32+
Large language models (LLMs) have demonstrated strong capabilities in programming and mathematical reasoning tasks, but are constrained by limited high-quality training data. Synthetic data can be leveraged to enhance fine-tuning outcomes, but several factors influence this process, including model size, synthetic data volume, pruning strategy, and number of fine-tuning rounds. We explore these axes and investigate which conditions enable model self-improvement. We introduce the Think, Prune, Train process, a scalable framework that iteratively fine-tunes models on their own reasoning traces, using ground-truth pruning to ensure high-quality training data. This approach yields improved performance: on GSM8K, Gemma2-2B achieves a Pass@1 of 57.6% (from 41.9%), Gemma2-9B reaches 82%, matching LLaMA-3.1-70B, and LLaMA-3.1-70B attains 91%, even surpassing GPT-4o, demonstrating the effectiveness of self-generated reasoning and systematic data selection for improving LLM capabilities.

imgs/teasers/monkeyspower.png

150 KB
Loading

imgs/teasers/sdgrl.png

190 KB
Loading

imgs/teasers/tpt.png

34.1 KB
Loading

imgs/thumbs/monkeyspower.png

150 KB
Loading

imgs/thumbs/sdgrl.png

190 KB
Loading

imgs/thumbs/tpt.png

34.1 KB
Loading

pubs/monkeyspower.pdf

1.86 MB
Binary file not shown.

0 commit comments

Comments
 (0)