Skip to content

Commit 927fde0

Browse files
committed
Add pass@k definition and table title
1 parent 85089c2 commit 927fde0

File tree

1 file changed

+9
-2
lines changed

1 file changed

+9
-2
lines changed

_blogs/kernelbench.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,12 @@ Comparing across various models, we note while some models do well on Level 1 ta
221221

222222
Beyond greedy decoding, we are also interested in pass@k, having at least 1 correct (and successfully compiled) solution given k attempts, as introduced in the [HumanEval paper](https://arxiv.org/abs/2107.03374). We sample models with high decoding temperature (deepseek-coder with temp=1.6, and Llama 3.1 70b-Instruct with temp=0.8) for more diverse samples, compute pass@1,3,5,10 with N=100 samples.
223223

224+
Pass@k is defined as
225+
$$
226+
\text{pass@$k$} := \mathop{\mathbb{E}}_{\text{problems}} \left[ 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}} \right]
227+
$$
228+
where $n$ is the total number of samples and $c$ is the number of correct samples.
229+
224230
<img src="/imgs/blog/kernelbench/init_eval_passk.png" width="100%" />
225231
*Caption: Pass@k performance for Deepseek-coder and Llama 3.1 70B Instruct*
226232

@@ -234,11 +240,12 @@ We only analyze correctness in the section above. However, in the case of kernel
234240

235241
When evaluating performance, we prioritize correctness, as incorrect but fast code is not useful. Therefore, speedups are calculated using only the correct samples. To present a comprehensive view of performance, we report speedups in percentiles. The count of correct samples for each model is indicated in parentheses after the model name in the table below.
236242

237-
In addition to the baseline PyTorch implementation, we also compare speedups against torch.compile() using its default mode.
243+
In addition to the baseline PyTorch implementation, we also compare speedups against torch.compile() using its default mode. The speedup is defined as
244+
$$\frac{t\_baseline}{t\_generated}$$
238245

239246
<img src="/imgs/blog/kernelbench/init_eval_speedup.png" width="100%" />
240247

241-
*Caption: Percentile of Speedups for both Torch and Torch Compile across 3 levels*
248+
*Caption: Percentile of Speedups vs. Baseline for both Torch and Torch Compile across 3 levels*
242249

243250
Among the samples that are correct, we see that most generated kernels exhibit relatively slow speedups over torch and torch.compile baseline, but a few are notably faster as outliers\! This piqued our interest and led us to the following investigations.
244251

0 commit comments

Comments
 (0)