feat(190B): Coding Benchmark Panel — Wave 2 + Wave 3 (3-tab /benchmark UI with Chart.js)#430
Merged
mpieniak01 merged 10 commits intomainfrom Mar 3, 2026
Merged
Conversation
- New schemas: venom_core/api/schemas/benchmark_coding.py
- New service: venom_core/services/benchmark_coding.py
- New routes: venom_core/api/routes/benchmark_coding.py (POST /start, GET /list, GET /{id}/status, DELETE /all, DELETE /{id})
- Wire routes in router_wiring.py and main.py
- Tests: tests/test_benchmark_coding_service.py (24 tests)
- Tests: tests/test_benchmark_coding_routes.py (15 tests)
- Register in test catalog + sync pytest groups
Co-authored-by: mpieniak01 <8170413+mpieniak01@users.noreply.github.com>
Copilot stopped work on behalf of
mpieniak01 due to an error
March 3, 2026 14:14
- New hook: web-next/hooks/use-coding-benchmark.ts (polls /api/v1/benchmark/coding/*) - New component: coding-benchmark-configurator.tsx (model/task/loop config) - New component: coding-benchmark-results.tsx (jobs table + history) - Update benchmark page with Classic/Coding tab switcher - Add i18n: benchmark.tabs + benchmark.coding sections in en/pl/de - Add CodingBenchmark* types to web-next/lib/types.ts Co-authored-by: mpieniak01 <8170413+mpieniak01@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Integrate coding benchmarks with web-next panel
feat(190B): Coding Benchmark Panel — backend service + frontend UI (Wave 1 + Wave 2)
Mar 3, 2026
- New components with required names: benchmark-coding-configurator.tsx, benchmark-coding-console.tsx (with progress bar and run context) - page.tsx: tabs renamed LLM Models / Code Models, imports updated to new component names, CodingBenchmarkResults removed (Wave 3 scope) - use-coding-benchmark.ts: export resolvePollStatus + buildProgressLog pure helpers for testing - i18n: tabs.classic → tabs.llm (LLM Models); tabs.coding → Code Models; remove out-of-scope coding.results section in en/pl/de - tests/use-coding-benchmark.test.ts: 10 tests for hook pure helpers - tests/test_coding_run_service.py: +3 tests covering lines 155,217,222-223 (request_wall_seconds, invalid run_id, path traversal guard) → 100% new-code coverage Co-authored-by: mpieniak01 <8170413+mpieniak01@users.noreply.github.com>
Copilot
AI
changed the title
feat(190B): Coding Benchmark Panel — backend service + frontend UI (Wave 1 + Wave 2)
feat(190B-wave2): Coding Benchmark Panel — Wave 2 frontend (start/status)
Mar 3, 2026
- page.tsx: 3 tabs (LLM Models / Code Models / Code Results) with BarChart3 icon - benchmark-coding-results.tsx: new component (port from old coding-benchmark-results.tsx) - full i18n for all user-facing strings - history + jobs table + delete/clear - benchmark-coding-charts.tsx: new Chart.js charts - computePassRates + computeTimings exported pure helpers - pass-rate bar chart per model (green/amber by threshold) - timing stacked bar chart (warmup/coding/request per model) - Delete duplicate: coding-benchmark-configurator.tsx, coding-benchmark-results.tsx - i18n en/pl/de: tabs.llmModels|codeModels|codeResults + coding.results.* + coding.charts.* - tests: +8 chart adapter tests (computePassRates, computeTimings) → 18 total PASS Co-authored-by: mpieniak01 <8170413+mpieniak01@users.noreply.github.com>
Copilot
AI
changed the title
feat(190B-wave2): Coding Benchmark Panel — Wave 2 frontend (start/status)
feat(190B): Coding Benchmark Panel — Wave 2 + Wave 3 (3-tab /benchmark UI with Chart.js)
Mar 3, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Adds the Wave 2/3 “coding benchmark” frontend to /benchmark (3-tab UI) and introduces a new backend coding-benchmark service + API namespace to start/poll/list/delete runs, including persisted run metadata and job timing enrichment from artifacts.
Changes:
- Frontend: new
/benchmarktabs for Code Models (run config + live console) and Code Results (history + jobs table + Chart.js charts) plus a polling hook and TS types. - Backend: new
CodingBenchmarkServicethat launches the existing scheduler script, persists run state, enriches job timing from artifacts, and exposes/api/v1/benchmark/coding/*routes + schemas. - Testing/Docs: adds unit/contract tests, registers them in test catalog/groups, and updates agent docs/templates for new-test naming/checklist.
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| web-next/app/benchmark/page.tsx | Adds tabbed /benchmark UI and wires coding benchmark components/hook. |
| web-next/hooks/use-coding-benchmark.ts | New polling + start/delete/clear hook (and exported helpers). |
| web-next/components/benchmark/benchmark-coding-configurator.tsx | UI to select Ollama models/tasks and run parameters. |
| web-next/components/benchmark/benchmark-coding-console.tsx | Live log viewer + progress bar for coding benchmark runs. |
| web-next/components/benchmark/benchmark-coding-results.tsx | History list + current run jobs table + delete/clear actions. |
| web-next/components/benchmark/benchmark-coding-charts.tsx | Chart.js charts + pure adapters for pass rates/timings. |
| web-next/lib/types.ts | Adds coding benchmark request/response/job/run types for frontend. |
| web-next/lib/i18n/locales/{en,pl,de}.ts | Adds translation keys for new benchmark tabs/coding sections. |
| web-next/tests/use-coding-benchmark.test.ts | Adds unit tests for hook/chart pure helpers. |
| venom_core/services/benchmark_coding.py | Implements coding benchmark run lifecycle, persistence, scheduler launch, artifact enrichment. |
| venom_core/api/schemas/benchmark_coding.py | Pydantic schemas for coding benchmark endpoints. |
| venom_core/api/routes/benchmark_coding.py | FastAPI routes for start/status/list/delete/clear-all coding benchmarks. |
| venom_core/main.py | Initializes and injects CodingBenchmarkService; registers router. |
| venom_core/bootstrap/router_wiring.py | Wires coding benchmark router dependencies. |
| tests/test_coding_run_service.py | Unit tests for CodingBenchmarkService (incl. security guards + artifact timing field). |
| tests/test_coding_run_routes.py | Contract tests for coding benchmark API router. |
| docs/{AGENTS.md,PL/AGENTS.md} | Documents required new-test naming/checklist + i18n rule. |
| config/testing/test_catalog.json | Registers new backend tests in the catalog. |
| config/pytest-groups/{sonar-new-code,fast,ci-lite}.txt | Adds the new backend tests to lanes/groups. |
| .github/coding-agent-task-template.md | Updates template with naming/checklist guidance for new tests. |
web-next/components/benchmark/benchmark-coding-configurator.tsx
Outdated
Show resolved
Hide resolved
Owner
|
Uwagi z przeglądu #430 (review) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds full coding benchmark frontend on top of the Wave 1 backend.
/benchmarkgains two new tabs — Code Models (start/status) and Code Results (history + jobs table + Chart.js charts) — alongside the existing LLM Models tab. Also cleans up thecoding-benchmark-*naming duplication left from earlier work.Summary
Wave 2 + Wave 3 frontend for
/benchmark. Introduces tab-switched UI for coding benchmark: configurator + live console (Wave 2), and history + jobs table + Chart.js visualizations (Wave 3). Consolidates component naming tobenchmark-coding-*convention and removes duplicate files.Scope
In scope:
benchmark-coding-configurator.tsx— Ollama model × task multi-select, loop task, timeout, max_rounds, stop-on-failurebenchmark-coding-console.tsx— live log stream + progress bar (completed/total jobs)benchmark-coding-results.tsx— run history list + jobs table + delete/clear; fully i18n-dbenchmark-coding-charts.tsx— Chart.js canvas charts: pass-rate bar (per model) + timing stacked bar (warmup/coding/request); exportscomputePassRates+computeTimingsas pure testable helpersuse-coding-benchmark.ts— polling, start/status, delete/clearAll; exportsresolvePollStatus+buildProgressLogfor unit testsapp/benchmark/page.tsx— 3 tabs:llmModels/codeModels/codeResultsbenchmark.tabs.{llmModels,codeModels,codeResults}+benchmark.coding.{config,console,results,charts}.*(zero hardcoded user-facing strings)tests/use-coding-benchmark.test.ts— 18 unit tests covering hook helpers + chart data adapters_run_dirandrequest_wall_secondsfield → 13/13 new lines = 100% coverageOut of scope:
types.ts; regeneration deferred)Deleted duplicates:
coding-benchmark-configurator.tsxcoding-benchmark-results.tsxQuality Gates (Hard Gate)
make pr-fastpassedForbidden Paths Guard,Architecture drift guard,Backend lite (pytest),Frontend lite (lint),OpenAPI Contract (export + TS codegen),SonarCloud Scan,Quick validator (syntax + CI-lite deps))make test-catalog-sync+make test-groups-syncexecutedValidation Report
Commands run
node --import tsx --test tests/use-coding-benchmark.test.ts node --import tsx --test tests/*.test.ts npx tsc --noEmit make test-groups-check .venv/bin/python -m pytest tests/test_coding_run_service.py tests/test_coding_run_routes.py \ --cov=venom_core --cov-report=xml:test-results/sonar/python-coverage.xml -q make check-new-code-coverage-diagnostics make pr-fastResults (pass/fail)
tests/use-coding-benchmark.test.tstests/*.test.ts(web-next)tsc --noEmit(changed files)make test-groups-checkmake check-new-code-coverage-diagnosticsmake pr-fastChanged-lines coverage
13/13 = 100.0% (threshold 80%)
New backend tests cover:
_run_dirinvalid UUID guard (line 217), path-traversalrelative_tocheck (lines 222–223), andrequest_wall_secondsartifact field (line 155).Risks / Limitations / Skips
react-chartjs-2: Dynamicimport("chart.js/auto")insideuseEffect— avoids SSR, consistent with existingchart.jsdep only.codingRun.jobsfrom polling state. History entries don't carry full job arrays; full per-run chart drill-down is a future enhancement.first_sieve_taskhardcoded to""in configurator — wiring exposed in type but not in UI; Wave 4 item.types.tsand will drift untilmake openapi-export && make openapi-codegen-typesruns (CI gate catches this).Evidence
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
huggingface.co/home/REDACTED/work/Venom/Venom/.venv/bin/python /home/REDACTED/work/Venom/Venom/.venv/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))(dns block)unreachable.invalid/home/REDACTED/work/Venom/Venom/.venv/bin/python /home/REDACTED/work/Venom/Venom/.venv/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
This section details on the original issue you should resolve
<issue_title>190B — Integracja benchmarków codingowych z panelem web-next (
/benchmark)</issue_title><issue_description>## 190B — Integracja benchmarków codingowych z panelem web-next (
/benchmark)Cel PR 190B
Udostępnić użytkownikowi w UI samodzielne uruchamianie benchmarków codingowych (sanity/simple/complex/feedback-loop) bez użycia CLI.
Zakres runtime (jawnie)
190B) dotyczy benchmarków Ollama coding./api/tags,ollama list).warmup/coding/request) są liczone dla wywołań/api/generatew Ollama.Stan obecny (as-is)
/benchmarkobsługuje benchmark modeli LLM (latencja/szybkość/VRAM)./api/v1/benchmark/*).scripts/ollama_bench/*), ale nie są podpięte pod panel.Zakres docelowy (to-be) — 3 zakładki
LLM Models(istniejąca, bez regresji).Code Models(nowa):python_sanity,python_simple,python_complex,python_complex_bugfix),timeout,model-timeout-overrides,max-rounds,options(temperature/top_p),first-sieve-task,stop-on-failure,Code Results(nowa):warmup_seconds,coding_seconds,request_wall_seconds,total_seconds,Backend/API do dodania
POST /api/v1/benchmark/coding/startGET /api/v1/benchmark/coding/{run_id}/statusGET /api/v1/benchmark/coding/list?limit=...DELETE /api/v1/benchmark/coding/{run_id}DELETE /api/v1/benchmark/coding/allstarturuchamia scheduler/runner i zwracarun_id.statuszwraca:pending/running/completed/failed),scripts/ollama_bench/scheduler.py.Frontend (web-next) — zadania implementacyjne
app/benchmark/page.tsxo tabs + osobne sekcje per zakładka.use-coding-benchmarkanalogiczny douse-benchmark.benchmark-coding-configurator.tsxbenchmark-coding-console.tsxbenchmark-coding-results.tsxbenchmark-coding-charts.tsxweb-next/lib/types.ts.pl/en/dedla nowych etykiet i komunikatów.Wykresy (JQ/JS) — decyzja technologiczna
chart.js(web-next/package.json) i istnieje użycie (components/cockpit/token-chart.tsx).chart.js(bez dokładania nowej biblioteki),react-chartjs-2(wrapper do obecnego Chart.js).jq(CLI) może służyć do pre-agregacji artefaktów JSON po stronie dev/ops, ale nie jest biblioteką do renderu wykresów w UI.KPI i prezentacja na zakładce
Code Resultspass_rateper model.DoD dla PR 190B
/benchmarkma 3 działające zakładki (LLM Models,Code Models,Code Results).pl/en/de.Plan realizacji falami (mniejszy zakres na sesję agenta)
/api/v1/benchmark/coding/*,main.pyirouter_wiring.py,service/routes)./benchmark,Code Models(uruchamianie testów)Code Models,use-coding-benchmark,/benchmark) #427✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.