Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -205,3 +205,5 @@ cython_debug/
marimo/_static/
marimo/_lsp/
__marimo__/

b-pchanda/
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,21 @@ We provide examples using three datasets: Maze, Game of 24, and SpatialMap.
python ./examples/TTSwithVerification/[your_dataset]_stepverifier.py -n 1 # dataset=maze,game24, or spatialmap
```

For using TreeofThought

```bash
python ./examples/TTSwithVerification/tot_baseline.py \
--task maze \
--num_examples 4 \
--ports 8000,8001,8002,8003 \
--concurrency 4 \
--model Qwen/QwQ-32B \
--max_tokens 32768 \
```

This script loads the same datasets as the verification examples, spins up `TreeOfThoughtSearch` with configurable branching/depth, and round-robins requests across multiple vLLM instances for faster experimentation.


### Monitors for Early stopping
```bash
python ./examples/EarlyStopping/[your_dataset]_example.py -n 1
Expand Down
41 changes: 41 additions & 0 deletions examples/TTSwithVerification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,39 @@ The Z3 solver handles diagonal directions (`Northwest`, `Northeast`, `Southwest`

---

# Best-of-K Baseline

A simple best-of-K baseline that generates K independent reasoning traces per example and selects the best based on:
1. **Ground-truth matching** (default): Greedy selection of first correct answer among K samples
2. **Critic model evaluation** (optional): Use a separate critic LLM to evaluate correctness without access to ground truth

This baseline demonstrates that with sufficient sampling, even simple CoT can achieve good performance.

## Usage

```bash
# Best-of-K with ground-truth evaluation
python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 10 --k 4

# Best-of-K with critic model evaluation
python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 10 --k 4 --use_critic --critic_model Qwen/Qwen3-30B-A3B-Thinking-2507 --critic_port 8001
```

### Parameters

| Argument | Description | Default |
|----------|-------------|---------|
| `--task` | Task: `game24`, `maze`, or `spatialmap` | required |
| `--k` | Number of samples per example | `4` |
| `--use_critic` | Use critic model for evaluation instead of ground truth | `False` |
| `--critic_model` | Model to use for critic evaluation | MAIN_MODEL |
| `--critic_port` | vLLM server port for critic model | `8001` |
| `--num_examples`, `-n` | Number of examples to run | varies |
| `--main_model` | Model for generation | `Qwen/Qwen3-30B-A3B-Thinking-2507` |
| `--port` | vLLM server port for main model | `8000` |

---

## Example Scripts

Each script runs a full evaluation: loading a dataset, building structured prompts, running inference with step verification, and computing accuracy/token statistics.
Expand All @@ -169,6 +202,14 @@ python ./examples/TTSwithVerification/maze_stepverifier.py -n 1

# SpatialMap with step verification
python ./examples/TTSwithVerification/spatialmap_stepverifier.py -n 1

# Best-of-K baseline (standard CoT, no monitors)
python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 1 --k 4
python ./examples/TTSwithVerification/bestofk_baseline.py --task maze -n 1 --k 4
python ./examples/TTSwithVerification/bestofk_baseline.py --task spatialmap -n 1 --k 4

# Best-of-K with critic model evaluation
python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 1 --k 4 --use_critic
```

### Common arguments
Expand Down
Loading