Skip to content

Commit cd93877

Browse files
Merge pull request #25 from ScalingIntelligence/tanvir_dev
Better Teasers and Post Ruby Update Changes
2 parents cd9c4e1 + 3707868 commit cd93877

File tree

3 files changed

+13
-15
lines changed

3 files changed

+13
-15
lines changed

_pubs/monkeyspower.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
---
22
title: 'How Do Large Language Monkeys Get Their Power (Laws)?'
33
authors:
4-
- name: rylanschaeffer
4+
- name: Rylan Schaeffer
55
affiliation: University of California, Berkeley
6-
- name: joshuakazdan
6+
- name: Joshua Kazdan
77
affiliation: Stanford University
8-
- name: johnhughes
8+
- name: John Hughes
99
affiliation: University of Cambridge
1010
- name: Jordan Juravsky
1111
affiliation: University of California, Berkeley
@@ -20,16 +20,18 @@ authors:
2020
- key: azaliamirhoseini
2121
- name: Sanmi Koyejo
2222
affiliation: Stanford University
23-
venue: arXiv preprint
23+
venue: preprint
2424
year: 2025
25+
date: 2025-02-24
2526
month: February
2627
day: 24
2728
has_pdf: true
2829
doi: 10.48550/arXiv.2502.17578
2930
tags:
3031
- machine learning
3132
- generative ai
32-
teaser: Recent research documents a surprising finding: increasing inference compute through repeated sampling reveals power-law scaling in average success rates. This occurs despite per-problem exponential scaling, explained by heavy-tailed distributions of single-attempt success rates. Our analysis unifies these observations under a mathematical framework, offering new insights into inference-time scaling laws and more efficient performance forecasting for language models.
33+
slug: monkeyspower
34+
teaser: We explain how language models exhibit power-law scaling in success rates despite per-problem exponential scaling, revealing that heavy-tailed distributions of success probabilities drive this phenomenon and enabling more efficient performance forecasting.
3335
materials:
3436
- name: Paper
3537
url: https://arxiv.org/abs/2502.17578

_pubs/sdgrl.md

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,25 +12,20 @@ authors:
1212
affiliation: Google DeepMind
1313
- name: Christopher D. Manning
1414
affiliation: Stanford University
15-
venue: arXiv preprint
15+
venue: preprint
1616
year: 2025
17+
date: 2025-04-28
1718
month: April
1819
day: 28
1920
has_pdf: true
2021
doi: 10.48550/arXiv.2504.04736
2122
tags:
2223
- model
2324
- generative ai
24-
teaser: This paper introduces Step-Wise Reinforcement Learning (SWiRL), a framework for improving multi-step reasoning and tool use in language models through synthetic data generation and offline RL. SWiRL decomposes reasoning trajectories into sub-trajectories, enabling fine-grained feedback and significant accuracy improvements across challenging tasks like HotPotQA, GSM8K, MuSiQue, CofCA, and BeerQA. Notably, SWiRL-trained models outperform larger proprietary models in multi-step reasoning while demonstrating strong task generalization and improved cost efficiency.
25+
teaser: SWiRL is a framework that improves language model reasoning through synthetic data generation and step-wise reinforcement learning, enabling models to outperform larger proprietary models across diverse reasoning tasks while demonstrating strong generalization capabilities.
2526
materials:
2627
- name: Paper
2728
url: https://arxiv.org/abs/2504.04736
2829
type: file-pdf
29-
- name: Code Repository
30-
url: https://github.com/ScalingIntelligence/swirl_rl
31-
type: code
32-
- name: Synthetic Multi-Step Reasoning Dataset
33-
url: https://huggingface.co/datasets/ScalingIntelligence/swirl_synthetic_data
34-
type: database
3530
---
3631
Reinforcement learning has been shown to improve the performance of large language models. However, traditional approaches like RLHF or RLAIF treat the problem as single-step. As focus shifts toward more complex reasoning and agentic tasks, language models must take multiple steps of text generation, reasoning and environment interaction before generating a solution. We propose a synthetic data generation and RL methodology targeting multi-step optimization scenarios. This approach, called Step-Wise Reinforcement Learning (SWiRL), iteratively generates multi-step reasoning and tool use data, and then learns from that data. It employs a simple step-wise decomposition that breaks each multi-step trajectory into multiple sub-trajectories corresponding to each action by the original model. It then applies synthetic data filtering and RL optimization on these sub-trajectories. We evaluated SWiRL on a number of multi-step tool use, question answering, and mathematical reasoning tasks. Our experiments show that SWiRL outperforms baseline approaches by 21.5%, 12.3%, 14.8%, 11.1%, and 15.3% in relative accuracy on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA, respectively. Excitingly, the approach exhibits generalization across tasks: for example, training only on HotPotQA (text question-answering) improves zero-shot performance on GSM8K (a math dataset) by a relative 16.9%.

_pubs/tpt.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,9 @@ authors:
66
- name: Anna Goldie
77
affiliation: Stanford University
88
- key: azaliamirhoseini
9-
venue: arXiv preprint
9+
venue: preprint
1010
year: 2025
11+
date: 2025-04-25
1112
month: April
1213
day: 25
1314
has_pdf: true
@@ -16,7 +17,7 @@ tags:
1617
- machine learning
1718
- model
1819
- generative ai
19-
teaser: This paper introduces the Think, Prune, Train (TPT) framework, a scalable method for improving language model reasoning without increasing model size. By iteratively fine-tuning models on their own reasoning traces and applying correctness-based pruning, TPT enables smaller models to achieve performance rivaling or exceeding larger ones. Experimental results on GSM8K and CodeContests show that models like Gemma-2B and LLaMA-70B-Instruct can surpass even GPT-4o on Pass@1 accuracy through recursive self-improvement.
20+
teaser: Think, Prune, Train (TPT) is a scalable framework that enables smaller language models to achieve performance rivaling larger ones through iterative self-improvement on their own reasoning traces, with experimental results showing models like Gemma-2B and LLaMA-70B-Instruct surpassing GPT-4o on reasoning tasks.
2021
materials:
2122
- name: Paper
2223
url: https://arxiv.org/abs/2504.18116

0 commit comments

Comments
 (0)