Skip to content

Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development.

License

bytedance/web-bench

Repository files navigation

Web-Bench

中文InstallPaperDatasetsLeaderBoardCitation

📖 Overview

Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development. Web-Bench contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, which were designed by engineers with 5-10 years of experience, each presents a significant challenge. On average, a single project takes 4–8 hours for a senior engineer to complete. On our given benchmark agent (Web-Agent), SOTA (Claude 3.7 Sonnet) achieves only 25.1% Pass@1.

The distribution of the experimental data aligns well with the current code generation capabilities of mainstream LLMs.

pass@1

HumanEval and MBPP have approached saturation. APPS and EvalPlus are approaching saturation. The SOTA for Web-Bench is 25.1%, which is lower (better) than that of the SWE-bench Full and Verified sets.

SOTAs

🚀 Quick Start

Refer to the Docker setup guide for instructions on installing Docker on your machine

  1. Create a new empty folder, add two files in this folder:
./config.json5
./docker-compose.yml
  1. For config.json5, copy the json below and edit by Config Parameters:
{
  models: [
    'openai/gpt-4o',
    // You can add more models here
    // "claude-sonnet-4-20250514"
  ],
  // Eval one project only
  // "projects": ["@web-bench/react"]
}
  1. For docker-compose.yml, copy the yaml below and set environment
services:
  web-bench:
    image: maoyiweiebay777/web-bench:latest
    volumes:
      - ./config.json5:/app/apps/eval/src/config.json5
      - ./report:/app/apps/eval/report
    environment:
      # Add enviorment variables according to apps/src/model.json
      - OPENROUTER_API_KEY=your_api_key
      # Add more model's key
      # - ANTHROPIC_API_KEY=your_api_key
  1. Run docker-compose:
docker compose up
  1. Evaluation Report will be generated under ./report/

If you wish to evaluate from source code, refer to Install from source.

🛠️ Contribution

📚 Citation

@article{xu2025webbench,
  title={Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks},
  author={Xu, Kai and Mao, YiWei and Guan, XinYi and Feng, ZiLong},
  journal={arXiv preprint arXiv:2505.07473},
  year={2025}
}

📄 License

Apache 2.0

🌟 Contact us

  • Lark: Scan the QR code below with Register Feishu to join our Web Bench user group.

pass@1

About

Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •