Skip to content

feat(190B): Coding Benchmark Panel — Wave 2 + Wave 3 (3-tab /benchmark UI with Chart.js)#430

Merged
mpieniak01 merged 10 commits intomainfrom
copilot/integrate-coding-benchmarks-panel
Mar 3, 2026
Merged

feat(190B): Coding Benchmark Panel — Wave 2 + Wave 3 (3-tab /benchmark UI with Chart.js)#430
mpieniak01 merged 10 commits intomainfrom
copilot/integrate-coding-benchmarks-panel

Conversation

Copy link
Contributor

Copilot AI commented Mar 3, 2026

Adds full coding benchmark frontend on top of the Wave 1 backend. /benchmark gains two new tabs — Code Models (start/status) and Code Results (history + jobs table + Chart.js charts) — alongside the existing LLM Models tab. Also cleans up the coding-benchmark-* naming duplication left from earlier work.

Summary

Wave 2 + Wave 3 frontend for /benchmark. Introduces tab-switched UI for coding benchmark: configurator + live console (Wave 2), and history + jobs table + Chart.js visualizations (Wave 3). Consolidates component naming to benchmark-coding-* convention and removes duplicate files.

Scope

In scope:

  • benchmark-coding-configurator.tsx — Ollama model × task multi-select, loop task, timeout, max_rounds, stop-on-failure
  • benchmark-coding-console.tsx — live log stream + progress bar (completed/total jobs)
  • benchmark-coding-results.tsx — run history list + jobs table + delete/clear; fully i18n-d
  • benchmark-coding-charts.tsx — Chart.js canvas charts: pass-rate bar (per model) + timing stacked bar (warmup/coding/request); exports computePassRates + computeTimings as pure testable helpers
  • use-coding-benchmark.ts — polling, start/status, delete/clearAll; exports resolvePollStatus + buildProgressLog for unit tests
  • app/benchmark/page.tsx — 3 tabs: llmModels / codeModels / codeResults
  • i18n en/pl/de: benchmark.tabs.{llmModels,codeModels,codeResults} + benchmark.coding.{config,console,results,charts}.* (zero hardcoded user-facing strings)
  • tests/use-coding-benchmark.test.ts — 18 unit tests covering hook helpers + chart data adapters
  • Backend tests: +3 tests for security-hardened _run_dir and request_wall_seconds field → 13/13 new lines = 100% coverage

Out of scope:

  • Backend endpoint/schema changes
  • OpenAPI codegen update (hand-written types in types.ts; regeneration deferred)

Deleted duplicates:

  • coding-benchmark-configurator.tsx
  • coding-benchmark-results.tsx

Quality Gates (Hard Gate)

  • make pr-fast passed
  • CI required checks passed (Forbidden Paths Guard, Architecture drift guard, Backend lite (pytest), Frontend lite (lint), OpenAPI Contract (export + TS codegen), SonarCloud Scan, Quick validator (syntax + CI-lite deps))
  • For new/renamed tests: make test-catalog-sync + make test-groups-sync executed

Validation Report

Commands run

node --import tsx --test tests/use-coding-benchmark.test.ts
node --import tsx --test tests/*.test.ts
npx tsc --noEmit
make test-groups-check
.venv/bin/python -m pytest tests/test_coding_run_service.py tests/test_coding_run_routes.py \
  --cov=venom_core --cov-report=xml:test-results/sonar/python-coverage.xml -q
make check-new-code-coverage-diagnostics
make pr-fast

Results (pass/fail)

Command Result
tests/use-coding-benchmark.test.ts PASS — 18/18
tests/*.test.ts (web-next) PASS — 187/187
tsc --noEmit (changed files) PASS — 0 new errors
make test-groups-check PASS
backend tests (42 total) PASS — 42/42
make check-new-code-coverage-diagnostics PASS
make pr-fast PASS ✅

Changed-lines coverage

13/13 = 100.0% (threshold 80%)

New backend tests cover: _run_dir invalid UUID guard (line 217), path-traversal relative_to check (lines 222–223), and request_wall_seconds artifact field (line 155).

Risks / Limitations / Skips

  • Chart.js via canvas, no react-chartjs-2: Dynamic import("chart.js/auto") inside useEffect — avoids SSR, consistent with existing chart.js dep only.
  • Code Results charts show current run only: Charts render codingRun.jobs from polling state. History entries don't carry full job arrays; full per-run chart drill-down is a future enhancement.
  • first_sieve_task hardcoded to "" in configurator — wiring exposed in type but not in UI; Wave 4 item.
  • OpenAPI types not regenerated: coding schemas are hand-written in types.ts and will drift until make openapi-export && make openapi-codegen-types runs (CI gate catches this).

Evidence

# web-next unit tests
# tests 187 / suites 52 / pass 187 / fail 0

# backend new-code coverage
Changed lines coverage: 13/13 = 100.0% (required: 80.0%)
PASS: changed-lines coverage meets the threshold.

# hard gate
✅ PR fast check passed.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • huggingface.co
    • Triggering command: /home/REDACTED/work/Venom/Venom/.venv/bin/python /home/REDACTED/work/Venom/Venom/.venv/bin/python -u -c import sys;exec(eval(sys.stdin.readline())) (dns block)
  • unreachable.invalid
    • Triggering command: /home/REDACTED/work/Venom/Venom/.venv/bin/python /home/REDACTED/work/Venom/Venom/.venv/bin/python -u -c import sys;exec(eval(sys.stdin.readline())) (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>190B — Integracja benchmarków codingowych z panelem web-next (/benchmark)</issue_title>
<issue_description>## 190B — Integracja benchmarków codingowych z panelem web-next (/benchmark)

Cel PR 190B

Udostępnić użytkownikowi w UI samodzielne uruchamianie benchmarków codingowych (sanity/simple/complex/feedback-loop) bez użycia CLI.

Zakres runtime (jawnie)

  1. Ten etap (190B) dotyczy benchmarków Ollama coding.
  2. Testowane modele pochodzą z lokalnego stosu Ollama (/api/tags, ollama list).
  3. Parametry i metryki (warmup/coding/request) są liczone dla wywołań /api/generate w Ollama.

Stan obecny (as-is)

  1. Ekran /benchmark obsługuje benchmark modeli LLM (latencja/szybkość/VRAM).
  2. Backend ma endpointy benchmarkowe dla obecnego flow (/api/v1/benchmark/*).
  3. Skrypty coding benchmark istnieją (scripts/ollama_bench/*), ale nie są podpięte pod panel.

Zakres docelowy (to-be) — 3 zakładki

  1. Zakładka 1: LLM Models (istniejąca, bez regresji).
  2. Zakładka 2: Code Models (nowa):
    • wybór modeli (multi-select),
    • wybór zadań (python_sanity, python_simple, python_complex, python_complex_bugfix),
    • parametry uruchomienia: timeout, model-timeout-overrides, max-rounds, options (temperature/top_p), first-sieve-task, stop-on-failure,
    • tryb uruchomienia: pojedynczy task lub scheduler.
  3. Zakładka 3: Code Results (nowa):
    • tabela wyników (PASS/FAIL, error, task, model),
    • metryki czasu: warmup_seconds, coding_seconds, request_wall_seconds, total_seconds,
    • wykresy porównawcze (ranking, czasy, pass rate),
    • historia uruchomień i możliwość odświeżenia/usunięcia wpisu.

Backend/API do dodania

  1. Nowy namespace API, np.:
    • POST /api/v1/benchmark/coding/start
    • GET /api/v1/benchmark/coding/{run_id}/status
    • GET /api/v1/benchmark/coding/list?limit=...
    • DELETE /api/v1/benchmark/coding/{run_id}
    • DELETE /api/v1/benchmark/coding/all
  2. Endpoint start uruchamia scheduler/runner i zwraca run_id.
  3. Endpoint status zwraca:
    • stan (pending/running/completed/failed),
    • postęp jobów,
    • wyniki cząstkowe i finalne,
    • ścieżki artefaktów i metryki timing.
  4. Kontrakt payload oparty o istniejące parametry skryptów scripts/ollama_bench/scheduler.py.

Frontend (web-next) — zadania implementacyjne

  1. Rozszerzyć app/benchmark/page.tsx o tabs + osobne sekcje per zakładka.
  2. Dodać hook use-coding-benchmark analogiczny do use-benchmark.
  3. Dodać komponenty:
    • benchmark-coding-configurator.tsx
    • benchmark-coding-console.tsx
    • benchmark-coding-results.tsx
    • benchmark-coding-charts.tsx
  4. Dodać typy w web-next/lib/types.ts.
  5. Dodać tłumaczenia pl/en/de dla nowych etykiet i komunikatów.
  6. Dodać testy UI/hooków dla nowego flow.

Wykresy (JQ/JS) — decyzja technologiczna

  1. W projekcie już jest chart.js (web-next/package.json) i istnieje użycie (components/cockpit/token-chart.tsx).
  2. Rekomendacja na PR 190B:
    • użyć istniejącego chart.js (bez dokładania nowej biblioteki),
    • przygotować lekką warstwę adaptera do danych benchmarkowych.
  3. Jeśli potrzebna szybsza ergonomia komponentów:
    • opcjonalnie dodać react-chartjs-2 (wrapper do obecnego Chart.js).
  4. jq (CLI) może służyć do pre-agregacji artefaktów JSON po stronie dev/ops, ale nie jest biblioteką do renderu wykresów w UI.

KPI i prezentacja na zakładce Code Results

  1. pass_rate per model.
  2. Mediany czasów:
    • warmup,
    • coding,
    • request.
  3. Liczba fail z podziałem na kategorię:
    • timeout,
    • empty response,
    • invalid format,
    • test/lint failure.
  4. Wykresy:
    • bar chart: pass_rate per model,
    • stacked bar: warmup vs coding,
    • table + sort + filters (task/model/status/date).

DoD dla PR 190B

  1. /benchmark ma 3 działające zakładki (LLM Models, Code Models, Code Results).
  2. Użytkownik może uruchomić coding benchmark z UI i śledzić status.
  3. Wyniki i metryki timing są prezentowane tabelarycznie i graficznie.
  4. Historia uruchomień jest dostępna i odświeżalna.
  5. Brak regresji istniejącej zakładki LLM benchmark.
  6. i18n parity: wszystkie nowe teksty w pl/en/de.

Plan realizacji falami (mniejszy zakres na sesję agenta)

  1. Fala 1: backend API + kontrakty danych
    • Zakres:
      • endpointy /api/v1/benchmark/coding/*,
      • serwis uruchamiania benchmarku i statusów,
      • schematy request/response,
      • wiring w main.py i router_wiring.py,
      • testy backendowe (service/routes).
    • Poza zakresem:
      • zmiany UI zakładek /benchmark,
      • wykresy frontend.
    • DoD fali:
      • endpointy działają lokalnie,
      • testy backendowe zielone,
      • brak regresji istniejącego API benchmark.
  2. Fala 2: frontend Code Models (uruchamianie testów)
    • Zakres:
        1. zakładka Code Models,
      • use-coding-benchmark,
      • konfigurator + k...

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

- New schemas: venom_core/api/schemas/benchmark_coding.py
- New service: venom_core/services/benchmark_coding.py
- New routes: venom_core/api/routes/benchmark_coding.py (POST /start, GET /list, GET /{id}/status, DELETE /all, DELETE /{id})
- Wire routes in router_wiring.py and main.py
- Tests: tests/test_benchmark_coding_service.py (24 tests)
- Tests: tests/test_benchmark_coding_routes.py (15 tests)
- Register in test catalog + sync pytest groups

Co-authored-by: mpieniak01 <8170413+mpieniak01@users.noreply.github.com>
Copilot stopped work on behalf of mpieniak01 due to an error March 3, 2026 14:14
mpieniak01 and others added 2 commits March 3, 2026 15:23
- New hook: web-next/hooks/use-coding-benchmark.ts (polls /api/v1/benchmark/coding/*)
- New component: coding-benchmark-configurator.tsx (model/task/loop config)
- New component: coding-benchmark-results.tsx (jobs table + history)
- Update benchmark page with Classic/Coding tab switcher
- Add i18n: benchmark.tabs + benchmark.coding sections in en/pl/de
- Add CodingBenchmark* types to web-next/lib/types.ts

Co-authored-by: mpieniak01 <8170413+mpieniak01@users.noreply.github.com>
Copilot AI changed the title [WIP] Integrate coding benchmarks with web-next panel feat(190B): Coding Benchmark Panel — backend service + frontend UI (Wave 1 + Wave 2) Mar 3, 2026
- New components with required names: benchmark-coding-configurator.tsx,
  benchmark-coding-console.tsx (with progress bar and run context)
- page.tsx: tabs renamed LLM Models / Code Models, imports updated to
  new component names, CodingBenchmarkResults removed (Wave 3 scope)
- use-coding-benchmark.ts: export resolvePollStatus + buildProgressLog
  pure helpers for testing
- i18n: tabs.classic → tabs.llm (LLM Models); tabs.coding → Code Models;
  remove out-of-scope coding.results section in en/pl/de
- tests/use-coding-benchmark.test.ts: 10 tests for hook pure helpers
- tests/test_coding_run_service.py: +3 tests covering lines 155,217,222-223
  (request_wall_seconds, invalid run_id, path traversal guard) → 100% new-code coverage

Co-authored-by: mpieniak01 <8170413+mpieniak01@users.noreply.github.com>
Copilot AI changed the title feat(190B): Coding Benchmark Panel — backend service + frontend UI (Wave 1 + Wave 2) feat(190B-wave2): Coding Benchmark Panel — Wave 2 frontend (start/status) Mar 3, 2026
- page.tsx: 3 tabs (LLM Models / Code Models / Code Results) with BarChart3 icon
- benchmark-coding-results.tsx: new component (port from old coding-benchmark-results.tsx)
  - full i18n for all user-facing strings
  - history + jobs table + delete/clear
- benchmark-coding-charts.tsx: new Chart.js charts
  - computePassRates + computeTimings exported pure helpers
  - pass-rate bar chart per model (green/amber by threshold)
  - timing stacked bar chart (warmup/coding/request per model)
- Delete duplicate: coding-benchmark-configurator.tsx, coding-benchmark-results.tsx
- i18n en/pl/de: tabs.llmModels|codeModels|codeResults + coding.results.* + coding.charts.*
- tests: +8 chart adapter tests (computePassRates, computeTimings) → 18 total PASS

Co-authored-by: mpieniak01 <8170413+mpieniak01@users.noreply.github.com>
Copilot AI changed the title feat(190B-wave2): Coding Benchmark Panel — Wave 2 frontend (start/status) feat(190B): Coding Benchmark Panel — Wave 2 + Wave 3 (3-tab /benchmark UI with Chart.js) Mar 3, 2026
@mpieniak01 mpieniak01 marked this pull request as ready for review March 3, 2026 15:43
Copilot AI review requested due to automatic review settings March 3, 2026 15:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the Wave 2/3 “coding benchmark” frontend to /benchmark (3-tab UI) and introduces a new backend coding-benchmark service + API namespace to start/poll/list/delete runs, including persisted run metadata and job timing enrichment from artifacts.

Changes:

  • Frontend: new /benchmark tabs for Code Models (run config + live console) and Code Results (history + jobs table + Chart.js charts) plus a polling hook and TS types.
  • Backend: new CodingBenchmarkService that launches the existing scheduler script, persists run state, enriches job timing from artifacts, and exposes /api/v1/benchmark/coding/* routes + schemas.
  • Testing/Docs: adds unit/contract tests, registers them in test catalog/groups, and updates agent docs/templates for new-test naming/checklist.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
web-next/app/benchmark/page.tsx Adds tabbed /benchmark UI and wires coding benchmark components/hook.
web-next/hooks/use-coding-benchmark.ts New polling + start/delete/clear hook (and exported helpers).
web-next/components/benchmark/benchmark-coding-configurator.tsx UI to select Ollama models/tasks and run parameters.
web-next/components/benchmark/benchmark-coding-console.tsx Live log viewer + progress bar for coding benchmark runs.
web-next/components/benchmark/benchmark-coding-results.tsx History list + current run jobs table + delete/clear actions.
web-next/components/benchmark/benchmark-coding-charts.tsx Chart.js charts + pure adapters for pass rates/timings.
web-next/lib/types.ts Adds coding benchmark request/response/job/run types for frontend.
web-next/lib/i18n/locales/{en,pl,de}.ts Adds translation keys for new benchmark tabs/coding sections.
web-next/tests/use-coding-benchmark.test.ts Adds unit tests for hook/chart pure helpers.
venom_core/services/benchmark_coding.py Implements coding benchmark run lifecycle, persistence, scheduler launch, artifact enrichment.
venom_core/api/schemas/benchmark_coding.py Pydantic schemas for coding benchmark endpoints.
venom_core/api/routes/benchmark_coding.py FastAPI routes for start/status/list/delete/clear-all coding benchmarks.
venom_core/main.py Initializes and injects CodingBenchmarkService; registers router.
venom_core/bootstrap/router_wiring.py Wires coding benchmark router dependencies.
tests/test_coding_run_service.py Unit tests for CodingBenchmarkService (incl. security guards + artifact timing field).
tests/test_coding_run_routes.py Contract tests for coding benchmark API router.
docs/{AGENTS.md,PL/AGENTS.md} Documents required new-test naming/checklist + i18n rule.
config/testing/test_catalog.json Registers new backend tests in the catalog.
config/pytest-groups/{sonar-new-code,fast,ci-lite}.txt Adds the new backend tests to lanes/groups.
.github/coding-agent-task-template.md Updates template with naming/checklist guidance for new tests.

@mpieniak01
Copy link
Owner

Uwagi z przeglądu #430 (review)
uwzględnione w
1f8e8b7

@mpieniak01 mpieniak01 merged commit fbe6155 into main Mar 3, 2026
9 checks passed
@mpieniak01 mpieniak01 deleted the copilot/integrate-coding-benchmarks-panel branch March 3, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

190B — Integracja benchmarków codingowych z panelem web-next (/benchmark)

3 participants