Skip to content

Commit cd9c4e1

Browse files
Merge pull request #24 from ScalingIntelligence/tanvir_dev
Potential local to deploy fix
2 parents 6af406a + 5ba2664 commit cd9c4e1

File tree

3 files changed

+9
-20
lines changed

3 files changed

+9
-20
lines changed

_pubs/monkeyspower.md

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -28,14 +28,11 @@ has_pdf: true
2828
doi: 10.48550/arXiv.2502.17578
2929
tags:
3030
- machine learning
31-
- scaling laws
32-
- generative AI
33-
- inference compute
34-
teaser:
35-
Recent research documents a surprising finding: increasing inference compute through repeated sampling reveals power-law scaling in average success rates. This occurs despite per-problem exponential scaling, explained by heavy-tailed distributions of single-attempt success rates. Our analysis unifies these observations under a mathematical framework, offering new insights into inference-time scaling laws and more efficient performance forecasting for language models.
31+
- generative ai
32+
teaser: Recent research documents a surprising finding: increasing inference compute through repeated sampling reveals power-law scaling in average success rates. This occurs despite per-problem exponential scaling, explained by heavy-tailed distributions of single-attempt success rates. Our analysis unifies these observations under a mathematical framework, offering new insights into inference-time scaling laws and more efficient performance forecasting for language models.
3633
materials:
3734
- name: Paper
3835
url: https://arxiv.org/abs/2502.17578
3936
type: file-pdf
4037
---
41-
Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts. In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts. We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, ~2-4 orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.
38+
Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts. In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts. We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, ~2-4 orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.

_pubs/sdgrl.md

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,9 @@ day: 28
1919
has_pdf: true
2020
doi: 10.48550/arXiv.2504.04736
2121
tags:
22-
- reinforcement learning
23-
- language models
24-
- reasoning
25-
- tool use
26-
- synthetic data
27-
- generative AI
28-
teaser:
29-
This paper introduces Step-Wise Reinforcement Learning (SWiRL), a framework for improving multi-step reasoning and tool use in language models through synthetic data generation and offline RL. SWiRL decomposes reasoning trajectories into sub-trajectories, enabling fine-grained feedback and significant accuracy improvements across challenging tasks like HotPotQA, GSM8K, MuSiQue, CofCA, and BeerQA. Notably, SWiRL-trained models outperform larger proprietary models in multi-step reasoning while demonstrating strong task generalization and improved cost efficiency.
22+
- model
23+
- generative ai
24+
teaser: This paper introduces Step-Wise Reinforcement Learning (SWiRL), a framework for improving multi-step reasoning and tool use in language models through synthetic data generation and offline RL. SWiRL decomposes reasoning trajectories into sub-trajectories, enabling fine-grained feedback and significant accuracy improvements across challenging tasks like HotPotQA, GSM8K, MuSiQue, CofCA, and BeerQA. Notably, SWiRL-trained models outperform larger proprietary models in multi-step reasoning while demonstrating strong task generalization and improved cost efficiency.
3025
materials:
3126
- name: Paper
3227
url: https://arxiv.org/abs/2504.04736

_pubs/tpt.md

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,9 @@ has_pdf: true
1414
doi: 10.48550/arXiv.2504.18116
1515
tags:
1616
- machine learning
17-
- language models
18-
- reasoning
19-
- self-improvement
20-
- generative AI
21-
teaser:
22-
This paper introduces the Think, Prune, Train (TPT) framework, a scalable method for improving language model reasoning without increasing model size. By iteratively fine-tuning models on their own reasoning traces and applying correctness-based pruning, TPT enables smaller models to achieve performance rivaling or exceeding larger ones. Experimental results on GSM8K and CodeContests show that models like Gemma-2B and LLaMA-70B-Instruct can surpass even GPT-4o on Pass@1 accuracy through recursive self-improvement.
17+
- model
18+
- generative ai
19+
teaser: This paper introduces the Think, Prune, Train (TPT) framework, a scalable method for improving language model reasoning without increasing model size. By iteratively fine-tuning models on their own reasoning traces and applying correctness-based pruning, TPT enables smaller models to achieve performance rivaling or exceeding larger ones. Experimental results on GSM8K and CodeContests show that models like Gemma-2B and LLaMA-70B-Instruct can surpass even GPT-4o on Pass@1 accuracy through recursive self-improvement.
2320
materials:
2421
- name: Paper
2522
url: https://arxiv.org/abs/2504.18116

0 commit comments

Comments
 (0)