feat(190B): Coding Benchmark Panel — Wave 2 + Wave 3 (3-tab /benchmark UI with Chart.js) by Copilot · Pull Request #430 · mpieniak01/Venom

Copilot · 2026-03-03T13:07:29Z

Adds full coding benchmark frontend on top of the Wave 1 backend. /benchmark gains two new tabs — Code Models (start/status) and Code Results (history + jobs table + Chart.js charts) — alongside the existing LLM Models tab. Also cleans up the coding-benchmark-* naming duplication left from earlier work.

Summary

Wave 2 + Wave 3 frontend for /benchmark. Introduces tab-switched UI for coding benchmark: configurator + live console (Wave 2), and history + jobs table + Chart.js visualizations (Wave 3). Consolidates component naming to benchmark-coding-* convention and removes duplicate files.

Scope

In scope:

benchmark-coding-configurator.tsx — Ollama model × task multi-select, loop task, timeout, max_rounds, stop-on-failure
benchmark-coding-console.tsx — live log stream + progress bar (completed/total jobs)
benchmark-coding-results.tsx — run history list + jobs table + delete/clear; fully i18n-d
benchmark-coding-charts.tsx — Chart.js canvas charts: pass-rate bar (per model) + timing stacked bar (warmup/coding/request); exports computePassRates + computeTimings as pure testable helpers
use-coding-benchmark.ts — polling, start/status, delete/clearAll; exports resolvePollStatus + buildProgressLog for unit tests
app/benchmark/page.tsx — 3 tabs: llmModels / codeModels / codeResults
i18n en/pl/de: benchmark.tabs.{llmModels,codeModels,codeResults} + benchmark.coding.{config,console,results,charts}.* (zero hardcoded user-facing strings)
tests/use-coding-benchmark.test.ts — 18 unit tests covering hook helpers + chart data adapters
Backend tests: +3 tests for security-hardened _run_dir and request_wall_seconds field → 13/13 new lines = 100% coverage

Out of scope:

Backend endpoint/schema changes
OpenAPI codegen update (hand-written types in types.ts; regeneration deferred)

Deleted duplicates:

coding-benchmark-configurator.tsx
coding-benchmark-results.tsx

Quality Gates (Hard Gate)

make pr-fast passed
CI required checks passed (Forbidden Paths Guard, Architecture drift guard, Backend lite (pytest), Frontend lite (lint), OpenAPI Contract (export + TS codegen), SonarCloud Scan, Quick validator (syntax + CI-lite deps))
For new/renamed tests: make test-catalog-sync + make test-groups-sync executed

Validation Report

Commands run

node --import tsx --test tests/use-coding-benchmark.test.ts
node --import tsx --test tests/*.test.ts
npx tsc --noEmit
make test-groups-check
.venv/bin/python -m pytest tests/test_coding_run_service.py tests/test_coding_run_routes.py \
  --cov=venom_core --cov-report=xml:test-results/sonar/python-coverage.xml -q
make check-new-code-coverage-diagnostics
make pr-fast

Results (pass/fail)

Command	Result
`tests/use-coding-benchmark.test.ts`	PASS — 18/18
`tests/*.test.ts` (web-next)	PASS — 187/187
`tsc --noEmit` (changed files)	PASS — 0 new errors
`make test-groups-check`	PASS
backend tests (42 total)	PASS — 42/42
`make check-new-code-coverage-diagnostics`	PASS
`make pr-fast`	PASS ✅

Changed-lines coverage

13/13 = 100.0% (threshold 80%)

New backend tests cover: _run_dir invalid UUID guard (line 217), path-traversal relative_to check (lines 222–223), and request_wall_seconds artifact field (line 155).

Risks / Limitations / Skips

Chart.js via canvas, no react-chartjs-2: Dynamic import("chart.js/auto") inside useEffect — avoids SSR, consistent with existing chart.js dep only.
Code Results charts show current run only: Charts render codingRun.jobs from polling state. History entries don't carry full job arrays; full per-run chart drill-down is a future enhancement.
first_sieve_task hardcoded to "" in configurator — wiring exposed in type but not in UI; Wave 4 item.
OpenAPI types not regenerated: coding schemas are hand-written in types.ts and will drift until make openapi-export && make openapi-codegen-types runs (CI gate catches this).

Evidence

# web-next unit tests
# tests 187 / suites 52 / pass 187 / fail 0

# backend new-code coverage
Changed lines coverage: 13/13 = 100.0% (required: 80.0%)
PASS: changed-lines coverage meets the threshold.

# hard gate
✅ PR fast check passed.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

huggingface.co
- Triggering command: /home/REDACTED/work/Venom/Venom/.venv/bin/python /home/REDACTED/work/Venom/Venom/.venv/bin/python -u -c import sys;exec(eval(sys.stdin.readline())) (dns block)
unreachable.invalid
- Triggering command: /home/REDACTED/work/Venom/Venom/.venv/bin/python /home/REDACTED/work/Venom/Venom/.venv/bin/python -u -c import sys;exec(eval(sys.stdin.readline())) (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>190B — Integracja benchmarków codingowych z panelem web-next (/benchmark)</issue_title>
<issue_description>## 190B — Integracja benchmarków codingowych z panelem web-next (/benchmark)

Cel PR 190B

Udostępnić użytkownikowi w UI samodzielne uruchamianie benchmarków codingowych (sanity/simple/complex/feedback-loop) bez użycia CLI.

Zakres runtime (jawnie)

Ten etap (190B) dotyczy benchmarków Ollama coding.
Testowane modele pochodzą z lokalnego stosu Ollama (/api/tags, ollama list).
Parametry i metryki (warmup/coding/request) są liczone dla wywołań /api/generate w Ollama.

Stan obecny (as-is)

Ekran /benchmark obsługuje benchmark modeli LLM (latencja/szybkość/VRAM).
Backend ma endpointy benchmarkowe dla obecnego flow (/api/v1/benchmark/*).
Skrypty coding benchmark istnieją (scripts/ollama_bench/*), ale nie są podpięte pod panel.

Zakres docelowy (to-be) — 3 zakładki

Zakładka 1: LLM Models (istniejąca, bez regresji).
Zakładka 2: Code Models (nowa):
- wybór modeli (multi-select),
- wybór zadań (python_sanity, python_simple, python_complex, python_complex_bugfix),
- parametry uruchomienia: timeout, model-timeout-overrides, max-rounds, options (temperature/top_p), first-sieve-task, stop-on-failure,
- tryb uruchomienia: pojedynczy task lub scheduler.
Zakładka 3: Code Results (nowa):
- tabela wyników (PASS/FAIL, error, task, model),
- metryki czasu: warmup_seconds, coding_seconds, request_wall_seconds, total_seconds,
- wykresy porównawcze (ranking, czasy, pass rate),
- historia uruchomień i możliwość odświeżenia/usunięcia wpisu.

Backend/API do dodania

Nowy namespace API, np.:
- POST /api/v1/benchmark/coding/start
- GET /api/v1/benchmark/coding/{run_id}/status
- GET /api/v1/benchmark/coding/list?limit=...
- DELETE /api/v1/benchmark/coding/{run_id}
- DELETE /api/v1/benchmark/coding/all
Endpoint start uruchamia scheduler/runner i zwraca run_id.
Endpoint status zwraca:
- stan (pending/running/completed/failed),
- postęp jobów,
- wyniki cząstkowe i finalne,
- ścieżki artefaktów i metryki timing.
Kontrakt payload oparty o istniejące parametry skryptów scripts/ollama_bench/scheduler.py.

Frontend (web-next) — zadania implementacyjne

Rozszerzyć app/benchmark/page.tsx o tabs + osobne sekcje per zakładka.
Dodać hook use-coding-benchmark analogiczny do use-benchmark.
Dodać komponenty:
- benchmark-coding-configurator.tsx
- benchmark-coding-console.tsx
- benchmark-coding-results.tsx
- benchmark-coding-charts.tsx
Dodać typy w web-next/lib/types.ts.
Dodać tłumaczenia pl/en/de dla nowych etykiet i komunikatów.
Dodać testy UI/hooków dla nowego flow.

Wykresy (JQ/JS) — decyzja technologiczna

W projekcie już jest chart.js (web-next/package.json) i istnieje użycie (components/cockpit/token-chart.tsx).
Rekomendacja na PR 190B:
- użyć istniejącego chart.js (bez dokładania nowej biblioteki),
- przygotować lekką warstwę adaptera do danych benchmarkowych.
Jeśli potrzebna szybsza ergonomia komponentów:
- opcjonalnie dodać react-chartjs-2 (wrapper do obecnego Chart.js).
jq (CLI) może służyć do pre-agregacji artefaktów JSON po stronie dev/ops, ale nie jest biblioteką do renderu wykresów w UI.

KPI i prezentacja na zakładce `Code Results`

pass_rate per model.
Mediany czasów:
- warmup,
- coding,
- request.
Liczba fail z podziałem na kategorię:
- timeout,
- empty response,
- invalid format,
- test/lint failure.
Wykresy:
- bar chart: pass_rate per model,
- stacked bar: warmup vs coding,
- table + sort + filters (task/model/status/date).

DoD dla PR 190B

/benchmark ma 3 działające zakładki (LLM Models, Code Models, Code Results).
Użytkownik może uruchomić coding benchmark z UI i śledzić status.
Wyniki i metryki timing są prezentowane tabelarycznie i graficznie.
Historia uruchomień jest dostępna i odświeżalna.
Brak regresji istniejącej zakładki LLM benchmark.
i18n parity: wszystkie nowe teksty w pl/en/de.

Plan realizacji falami (mniejszy zakres na sesję agenta)

Fala 1: backend API + kontrakty danych
- Zakres:
  - endpointy /api/v1/benchmark/coding/*,
  - serwis uruchamiania benchmarku i statusów,
  - schematy request/response,
  - wiring w main.py i router_wiring.py,
  - testy backendowe (service/routes).
- Poza zakresem:
  - zmiany UI zakładek /benchmark,
  - wykresy frontend.
- DoD fali:
  - endpointy działają lokalnie,
  - testy backendowe zielone,
  - brak regresji istniejącego API benchmark.
Fala 2: frontend Code Models (uruchamianie testów)
- Zakres:
  - 1. zakładka Code Models,
  - use-coding-benchmark,
  - konfigurator + k...

Fixes 190B — Integracja benchmarków codingowych z panelem web-next (/benchmark) #427

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

- New schemas: venom_core/api/schemas/benchmark_coding.py - New service: venom_core/services/benchmark_coding.py - New routes: venom_core/api/routes/benchmark_coding.py (POST /start, GET /list, GET /{id}/status, DELETE /all, DELETE /{id}) - Wire routes in router_wiring.py and main.py - Tests: tests/test_benchmark_coding_service.py (24 tests) - Tests: tests/test_benchmark_coding_routes.py (15 tests) - Register in test catalog + sync pytest groups Co-authored-by: mpieniak01 <8170413+mpieniak01@users.noreply.github.com>

- New hook: web-next/hooks/use-coding-benchmark.ts (polls /api/v1/benchmark/coding/*) - New component: coding-benchmark-configurator.tsx (model/task/loop config) - New component: coding-benchmark-results.tsx (jobs table + history) - Update benchmark page with Classic/Coding tab switcher - Add i18n: benchmark.tabs + benchmark.coding sections in en/pl/de - Add CodingBenchmark* types to web-next/lib/types.ts Co-authored-by: mpieniak01 <8170413+mpieniak01@users.noreply.github.com>

- New components with required names: benchmark-coding-configurator.tsx, benchmark-coding-console.tsx (with progress bar and run context) - page.tsx: tabs renamed LLM Models / Code Models, imports updated to new component names, CodingBenchmarkResults removed (Wave 3 scope) - use-coding-benchmark.ts: export resolvePollStatus + buildProgressLog pure helpers for testing - i18n: tabs.classic → tabs.llm (LLM Models); tabs.coding → Code Models; remove out-of-scope coding.results section in en/pl/de - tests/use-coding-benchmark.test.ts: 10 tests for hook pure helpers - tests/test_coding_run_service.py: +3 tests covering lines 155,217,222-223 (request_wall_seconds, invalid run_id, path traversal guard) → 100% new-code coverage Co-authored-by: mpieniak01 <8170413+mpieniak01@users.noreply.github.com>

- page.tsx: 3 tabs (LLM Models / Code Models / Code Results) with BarChart3 icon - benchmark-coding-results.tsx: new component (port from old coding-benchmark-results.tsx) - full i18n for all user-facing strings - history + jobs table + delete/clear - benchmark-coding-charts.tsx: new Chart.js charts - computePassRates + computeTimings exported pure helpers - pass-rate bar chart per model (green/amber by threshold) - timing stacked bar chart (warmup/coding/request per model) - Delete duplicate: coding-benchmark-configurator.tsx, coding-benchmark-results.tsx - i18n en/pl/de: tabs.llmModels|codeModels|codeResults + coding.results.* + coding.charts.* - tests: +8 chart adapter tests (computePassRates, computeTimings) → 18 total PASS Co-authored-by: mpieniak01 <8170413+mpieniak01@users.noreply.github.com>

Copilot

Pull request overview

Adds the Wave 2/3 “coding benchmark” frontend to /benchmark (3-tab UI) and introduces a new backend coding-benchmark service + API namespace to start/poll/list/delete runs, including persisted run metadata and job timing enrichment from artifacts.

Changes:

Frontend: new /benchmark tabs for Code Models (run config + live console) and Code Results (history + jobs table + Chart.js charts) plus a polling hook and TS types.
Backend: new CodingBenchmarkService that launches the existing scheduler script, persists run state, enriches job timing from artifacts, and exposes /api/v1/benchmark/coding/* routes + schemas.
Testing/Docs: adds unit/contract tests, registers them in test catalog/groups, and updates agent docs/templates for new-test naming/checklist.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
web-next/app/benchmark/page.tsx	Adds tabbed `/benchmark` UI and wires coding benchmark components/hook.
web-next/hooks/use-coding-benchmark.ts	New polling + start/delete/clear hook (and exported helpers).
web-next/components/benchmark/benchmark-coding-configurator.tsx	UI to select Ollama models/tasks and run parameters.
web-next/components/benchmark/benchmark-coding-console.tsx	Live log viewer + progress bar for coding benchmark runs.
web-next/components/benchmark/benchmark-coding-results.tsx	History list + current run jobs table + delete/clear actions.
web-next/components/benchmark/benchmark-coding-charts.tsx	Chart.js charts + pure adapters for pass rates/timings.
web-next/lib/types.ts	Adds coding benchmark request/response/job/run types for frontend.
web-next/lib/i18n/locales/{en,pl,de}.ts	Adds translation keys for new benchmark tabs/coding sections.
web-next/tests/use-coding-benchmark.test.ts	Adds unit tests for hook/chart pure helpers.
venom_core/services/benchmark_coding.py	Implements coding benchmark run lifecycle, persistence, scheduler launch, artifact enrichment.
venom_core/api/schemas/benchmark_coding.py	Pydantic schemas for coding benchmark endpoints.
venom_core/api/routes/benchmark_coding.py	FastAPI routes for start/status/list/delete/clear-all coding benchmarks.
venom_core/main.py	Initializes and injects CodingBenchmarkService; registers router.
venom_core/bootstrap/router_wiring.py	Wires coding benchmark router dependencies.
tests/test_coding_run_service.py	Unit tests for CodingBenchmarkService (incl. security guards + artifact timing field).
tests/test_coding_run_routes.py	Contract tests for coding benchmark API router.
docs/{AGENTS.md,PL/AGENTS.md}	Documents required new-test naming/checklist + i18n rule.
config/testing/test_catalog.json	Registers new backend tests in the catalog.
config/pytest-groups/{sonar-new-code,fast,ci-lite}.txt	Adds the new backend tests to lanes/groups.
.github/coding-agent-task-template.md	Updates template with naming/checklist guidance for new tests.

web-next/components/benchmark/benchmark-coding-configurator.tsx

web-next/components/benchmark/benchmark-coding-results.tsx

venom_core/services/benchmark_coding.py

web-next/lib/types.ts

web-next/components/benchmark/benchmark-coding-console.tsx

web-next/components/benchmark/benchmark-coding-results.tsx

web-next/hooks/use-coding-benchmark.ts

venom_core/services/benchmark_coding.py

mpieniak01 · 2026-03-03T16:07:24Z

Uwagi z przeglądu #430 (review)
uwzględnione w
1f8e8b7

Initial plan

dc63b56

Copilot AI assigned Copilot and mpieniak01 Mar 3, 2026

Copilot started work on behalf of mpieniak01 March 3, 2026 13:07 View session

Copilot stopped work on behalf of mpieniak01 due to an error March 3, 2026 13:37
Copilot had to stop work due to a timeout.

Copilot started work on behalf of mpieniak01 March 3, 2026 13:43 View session

Copilot stopped work on behalf of mpieniak01 due to an error March 3, 2026 14:14
Copilot had to stop work due to a timeout.

test(coding): rename wave1 tests to pass fast-lane coverage selection

f950ce3

Copilot started work on behalf of mpieniak01 March 3, 2026 14:21 View session

mpieniak01 and others added 2 commits March 3, 2026 15:23

docs(agents): clarify new-test naming and coverage selection flow

eb2d981

Copilot AI changed the title ~~[WIP] Integrate coding benchmarks with web-next panel~~ feat(190B): Coding Benchmark Panel — backend service + frontend UI (Wave 1 + Wave 2) Mar 3, 2026

Copilot finished work on behalf of mpieniak01 March 3, 2026 14:35

fix(security): harden benchmark coding run path handling

9461612

Copilot started work on behalf of mpieniak01 March 3, 2026 14:49 View session

Copilot AI changed the title ~~feat(190B): Coding Benchmark Panel — backend service + frontend UI (Wave 1 + Wave 2)~~ feat(190B-wave2): Coding Benchmark Panel — Wave 2 frontend (start/status) Mar 3, 2026

Copilot finished work on behalf of mpieniak01 March 3, 2026 15:11

Copilot started work on behalf of mpieniak01 March 3, 2026 15:12 View session

Copilot AI changed the title ~~feat(190B-wave2): Coding Benchmark Panel — Wave 2 frontend (start/status)~~ feat(190B): Coding Benchmark Panel — Wave 2 + Wave 3 (3-tab /benchmark UI with Chart.js) Mar 3, 2026

Copilot finished work on behalf of mpieniak01 March 3, 2026 15:27

fix(sonar): document 500 responses and clean benchmark UI smells

c20fba3

mpieniak01 marked this pull request as ready for review March 3, 2026 15:43

Copilot AI review requested due to automatic review settings March 3, 2026 15:43

Copilot started reviewing on behalf of mpieniak01 March 3, 2026 15:44 View session

mpieniak01 added refactor test Next.js labels Mar 3, 2026

mpieniak01 added API LLM labels Mar 3, 2026

Copilot AI reviewed Mar 3, 2026

View reviewed changes

feat(benchmark): refine coding benchmark API, UI and i18n + tests update

1f8e8b7

mpieniak01 merged commit fbe6155 into main Mar 3, 2026
9 checks passed

mpieniak01 deleted the copilot/integrate-coding-benchmarks-panel branch March 3, 2026 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(190B): Coding Benchmark Panel — Wave 2 + Wave 3 (3-tab /benchmark UI with Chart.js)#430

feat(190B): Coding Benchmark Panel — Wave 2 + Wave 3 (3-tab /benchmark UI with Chart.js)#430
mpieniak01 merged 10 commits intomainfrom
copilot/integrate-coding-benchmarks-panel

Copilot AI commented Mar 3, 2026 •

edited by mpieniak01

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mpieniak01 commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Mar 3, 2026 • edited by mpieniak01 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Quality Gates (Hard Gate)

Validation Report

Commands run

Results (pass/fail)

Changed-lines coverage

Risks / Limitations / Skips

Evidence

I tried to connect to the following addresses, but was blocked by firewall rules:

Cel PR 190B

Zakres runtime (jawnie)

Stan obecny (as-is)

Zakres docelowy (to-be) — 3 zakładki

Backend/API do dodania

Frontend (web-next) — zadania implementacyjne

Wykresy (JQ/JS) — decyzja technologiczna

KPI i prezentacja na zakładce Code Results

DoD dla PR 190B

Plan realizacji falami (mniejszy zakres na sesję agenta)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mpieniak01 commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 3, 2026 •

edited by mpieniak01

Loading

KPI i prezentacja na zakładce `Code Results`