CreativeBench

Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

A benchmark and data synthesis framework for creative code generation across combinatorial and exploratory settings.

News

[2026.03.13] CreativeBench paper is available on arXiv.
[2026.03.13] CreativeBench dataset is available on Hugging Face.
[2026.03.13] The project homepage is now live.

CreativeBench is an open-source benchmark and data synthesis framework for creative code generation, featuring two complementary pipelines:

Combo (reverse-engineering): Combines solutions from different domains to synthesize new problems and tests.
Explore (self-play): Evolves problems through progressive constraints to elicit novel solutions.

This repository provides the pipelines, templates, and artifacts needed to reproduce the dataset generation process.

Introduction

CreativeBench targets creative code generation: the ability to produce correct, novel solutions under new constraints or from cross-domain recombination. We provide:

Combo: cross-domain code recombination + sandbox feedback, yielding novel tasks with verified tests.
Explore: progressive constraint self-play, encouraging diverse solution strategies beyond the baseline.

The framework is designed for reproducibility and extensibility, and can be adapted to other languages or models.

Project Structure

.
├── CreativeGen/
│   ├── combo/                 # reverse-engineering pipeline
│   └── explore/               # self-play pipeline
├── datasets-subset/           # sampled datasets only
├── evaluation/                # evaluation utilities
└── inference/                 # inference utilities

Data Resources

We provide sampled datasets in datasets-subset/.

Field definitions (each JSONL line):

question: problem statement
canonical_solution: reference solution
demo_test_func: public tests
full_test_func: comprehensive tests
language: programming language
difficulty: difficulty label

Combo Pipeline (Reverse-Engineering)

Overview

Select domain pairs and build combo prompts
Generate combined solutions
Validate in sandbox
Fix failed solutions using feedback
Generate tests and questions
Format final dataset

Run

bash CreativeGen/combo/run_combo_pipeline.sh \
  <num_combos> <max_fix_attempts> <input_jsonl>

Example:

bash CreativeGen/combo/run_combo_pipeline.sh 5 3 /path/to/input.jsonl

Outputs

A run folder is created under:

CreativeGen/combo/runs/run_YYYYMMDD_HHMMSS/

Key artifacts:

combo_final_success.jsonl
test_func.jsonl
combo_final_dataset.jsonl
combo_final_formatted.jsonl

Explore Pipeline (Self-Play)

Overview

Filter source dataset to Python-only (or target language)
Identify key techniques in baseline solutions
Add progressive constraints
Generate constrained solutions
Verify compliance and run sandbox validation
Compute creativity scores
Convert results to inference-ready flat dataset

Run

bash CreativeGen/explore/run_explore_pipeline.sh \
  /path/to/autocodebench.jsonl

Outputs

CreativeGen/explore/runs/run_YYYYMMDD_HHMMSS/
  creativity_evolution_results.json
  creativity_analysis.png
CreativeGen/explore/data/converted/*_infer_*.jsonl

Evaluation

If you have the sandbox server running, you can validate solutions with:

python3 CreativeGen/combo/src/call_sandbox.py \
  --input_file path/to/data.jsonl \
  --output path/to/output.jsonl \
  --solution_key canonical_solution

Sandbox usage details will be documented here.

Todo

The sandbox implementation is being cleaned up, and MultiLanguageSandbox/ is not included in the current public release yet.

Upload and document MultiLanguageSandbox/ for code execution and verification.
Release setup instructions for the sandbox service used by the combo and explore pipelines.
Add end-to-end verification examples for benchmark generation and inference evaluation.
Expand sandbox support and documentation for additional programming languages.

License

This project is released under the MIT License. See LICENSE for details.

Citation

If you use CreativeBench in your work, please cite:

@misc{wang2026creativebenchbenchmarkingenhancingmachine,
  title={CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges},
  author={Zi-Han Wang and Lam Nguyen and Zhengyang Zhao and Mengyue Yang and Chengwei Qin and Yujiu Yang and Linyi Yang},
  year={2026},
  eprint={2603.11863},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2603.11863},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CreativeBench

News

Contents

Introduction

Project Structure

Data Resources

Combo Pipeline (Reverse-Engineering)

Overview

Run

Outputs

Explore Pipeline (Self-Play)

Overview

Run

Outputs

Evaluation

Todo

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
CreativeGen		CreativeGen
assets		assets
datasets-subset		datasets-subset
evaluation		evaluation
inference		inference
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

CreativeBench

News

Contents

Introduction

Project Structure

Data Resources

Combo Pipeline (Reverse-Engineering)

Overview

Run

Outputs

Explore Pipeline (Self-Play)

Overview

Run

Outputs

Evaluation

Todo

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages