Conversation
- New ab_testing.py: ABVariant, ABPool with config-driven champion/challenger - DB schema: variant columns on ab_comparisons, new ab_variant_metrics table - Server: /api/ab/pool, /api/ab/compare (NDJSON streaming), /api/ab/metrics - Metrics: auto-increment win/loss/tie on preference submission - Client: pool-aware AB flow, interleaved NDJSON parsing, pool banner UI - Docs: configuration guide and API reference for new endpoints - Backward compatible: legacy manual provider-B mode still works when no pool
…move legacy provider-B
- base-config.yaml: render ab_testing config block in Jinja template - ab_testing.py: parse nested pool.champion/pool.variants config structure, add fallback to config['services']['ab_testing'] - app.py: fix NOT NULL violations (link/context cols use empty string), store user message before AB responses, add ConversationService import for variant metrics, proper FK handling for comparison records - chat.js: fix request key (message -> last_message), fix history format (use slice(-1) for nested array), fix timestamp units
langchain-ollama <1.1 drops the thinking/reasoning payload from models like qwen3 that use a thinking phase, producing hundreds of empty AIMessageChunks before the actual content arrives. Detect these empty chunks and emit a thinking_start event so the UI Agent Activity indicator stays alive during the thinking phase instead of showing a dead pause.
_stream_arm now emits properly structured events with tool_call_id, tool_name, tool_args at top level (matching what renderToolStart expects) instead of nesting everything in metadata. Also: - Add thinking_start/thinking_end handling to JS A/B event loop - Add step-timeline container to A/B trace HTML so tool rendering works - Extract tool calls from both memory (tool_inputs_by_id) and AIMessage fallback - Extract tool_output content from ToolMessage directly
Both Chat.stream() and _stream_arm() (A/B testing) now use a single PipelineEventFormatter class for converting PipelineOutput into structured JSON events. This eliminates ~200 lines of duplicated tool-extraction logic that had already diverged between the two paths. Key changes: - New event_formatter.py with PipelineEventFormatter class - Deferred tool_start emission (emits with output for stable ordering) - Progressive merging from tool_calls, additional_kwargs, tool_call_chunks - Stateful tracking of emitted/pending tool IDs - app.py: Chat.stream() inner loop reduced from ~260 to ~50 lines - app.py: _stream_arm() reduced from ~140 to ~20 lines - chat.js: extracted _renderStreamEvent() shared by both handlers - Removed dead code block (unreachable backfill after return) Deployed to submit76, verified regular streaming (with tools) and A/B comparison streaming both work identically.
HIGH severity fixes: - Extract _error_event() helper (3 sites) - Extract _ab_comparison_from_row() helper (3 sites) - Extract _trace_from_row() helper (3 sites) - Consolidate agent_class resolution to _get_agent_class_from_cfg() (6 sites) MEDIUM severity fixes: - Extract _ndjson_response() on FlaskAppWrapper (2 sites) - Merge _delete_git_repo/_delete_jira_project into _delete_source_documents() - Shared _readNDJSON() async generator in chat.js (2 sites, fixes buffer drain bug) - Merge like/dislike into _toggle_reaction() + _with_feedback_lock() - Dedup format_links score logic via _format_score_str() Net: -152 lines. All endpoints verified on submit76.
…ecution - Add keep_alive='24h' to ChatOllama model creation so Ollama retains models in VRAM between requests (default was 5 min eviction) - Add _prewarm_ab_models() to pre-load all AB variant models on startup via background threads calling Ollama /api/generate - Add timing instrumentation to _stream_arm (thread start, vectorstore ready, first event, and completion timestamps) - Both arms now start within 0.1-0.3s of each other (was ~19s+ gap due to cold model loading and sequential Ollama inference)
…-param signature), unused delete/list methods, dead /api/ab/create endpoint
…or non-admins - Add _get_request_client_id / _is_admin_request helpers to FlaskAppWrapper - ab_get_pool: returns is_admin flag, only reveals pool details to admins - ab_compare_stream: 403 for non-admins - ab_get_metrics: 403 for non-admins - ab_submit_preference / ab_get_pending: remain open (voting is ok) - Frontend: pass client_id to getABPool/getABMetrics - Frontend: hide entire A/B settings section for non-admins - Template: wrap A/B toggle in #ab-settings-section, hidden by default
- New clone icon button in agent dropdown (copy icon, between name and edit) - Clone mode opens editor pre-filled with source agent's name/prompt/tools - Name auto-set to '<Original> (variant)' with text selected for easy rename - Saves as new agent (mode: create) — no backend changes needed - CSS: clone button hover uses primary green color
Replace the useless toggle + read-only banner with an interactive pool editor in Settings → Advanced: - Lists all available agents with checkboxes - Click 'Champion' button to designate baseline agent - 'Save Pool' button calls new POST /api/ab/pool/set endpoint - 'Disable' button calls POST /api/ab/pool/disable - Status badge shows Active/Inactive - Validation: requires 2+ agents selected and one champion Backend: - POST /api/ab/pool/set: admin-only, builds ABPool from agent names - POST /api/ab/pool/disable: admin-only, clears pool - Both update ab_pool in-memory at runtime (no config file edit needed) A/B mode now auto-activates when pool is saved (no separate toggle).
Option C implementation: - ab_only field in AgentSpec: agents with ab_only: true in frontmatter are hidden from the main agent dropdown but appear in the pool editor - Quick variant button (+) on each agent row in pool editor: opens inline panel to create a variant with tool toggles, saved with ab_only: true - AB badge: agents marked ab_only show a blue 'AB' pill in the pool editor - Client-side duplicate name check in variant panel (backend already has 409) - Pool editor now uses allAgents (includes ab_only) for full visibility
# Conflicts: # src/cli/templates/base-config.yaml # src/interfaces/chat_app/app.py # src/interfaces/chat_app/static/chat.js # src/interfaces/chat_app/templates/index.html
… trace, config nesting - Add like/dislike/comment feedback buttons to A/B comparison arms - Start and stop trace timers for A/B arm messages - Emit ab_arms event early so frontend shows variant names immediately - Start agent activity trace collapsed by default in normal mode - Add CSS for variant name labels on A/B arms - Move ab_testing config under services.chat_app (preferred path) - Update load_ab_pool() to check services.chat_app.ab_testing first - Keep legacy services.ab_testing fallback for backward compatibility
Use .get() with sensible defaults for host, hostname, template_folder, static_folder, num_responses_until_feedback — these may not be present in Postgres if config was seeded without CLI-added fields.
There was a problem hiding this comment.
Pull request overview
Adds a server-driven champion/challenger A/B testing “pool” to the chat app, including admin configuration UI, a new comparison streaming endpoint, and Postgres-backed per-variant metrics.
Changes:
- Introduces
ABPool/ABVariantutilities and server endpoints for pool config, streaming comparisons, preference submission, and metrics. - Refactors frontend A/B behavior to stream a single interleaved NDJSON comparison (two arms) and adds an admin-only pool editor + quick variant creation.
- Extends DB schema and SQL to store variant metadata on comparisons and maintain aggregate variant metrics; adds Playwright E2E coverage.
Reviewed changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
tests/ui/workflows/21-ab-testing.spec.ts |
New E2E workflows covering admin gating, pool editor, comparison streaming, and voting. |
tests/ui/fixtures.ts |
Adds A/B-related mock data and route helpers for Playwright. |
src/utils/sql.py |
Extends A/B comparison queries and adds variant-metrics upsert/select queries. |
src/utils/conversation_service.py |
Stores/reads variant metadata on comparisons; adds variant-metrics update/query helpers. |
src/utils/ab_testing.py |
New pool/variant config loader, validation, and challenger sampling logic. |
src/interfaces/chat_app/templates/index.html |
Replaces old A/B toggle UI with admin-only pool editor section. |
src/interfaces/chat_app/static/chat.js |
Adds pool endpoints, NDJSON reader, pool editor UI, A/B stream handling, and “clone as variant” UX. |
src/interfaces/chat_app/static/chat.css |
Styles pool editor, quick-variant panel, and new A/B arm layout. |
src/interfaces/chat_app/event_formatter.py |
New shared formatter to unify streaming event shaping across normal and A/B streaming. |
src/interfaces/chat_app/app.py |
Adds pool/compare/metrics endpoints; implements threaded A/B streaming; integrates formatter; refactors helpers. |
src/cli/templates/init.sql |
Adds variant metadata columns to ab_comparisons and creates ab_variant_metrics. |
src/cli/templates/base-config.yaml |
Adds template support for ab_testing configuration. |
src/bin/service_chat.py |
Makes service startup more robust to missing config keys; resolves template/static paths. |
src/archi/providers/local_provider.py |
Changes Ollama model kwargs to keep models alive for longer. |
src/archi/pipelines/agents/base_react.py |
Emits implicit thinking events when providers produce empty chunks. |
src/archi/pipelines/agents/agent_spec.py |
Adds ab_only frontmatter flag support to agent specs. |
docs/docs/configuration.md |
Documents A/B testing pool configuration and variant fields. |
docs/docs/api_reference.md |
Documents pool/compare/metrics endpoints (and notes legacy). |
configs/submit76/config.yaml |
Adds a deployment config example enabling the A/B pool. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@haozturk @hassan11196 @JasonMoho The PR has been updated and is ready to be reviewed now. One thing that needs your testing (Hasan, Hassan) is the integration with the RBAC setup, since I don't have the according secrets, config entries, whatever else, to launch this myself. I set up a proxy admin role to test, which worked, but not the real thing (you can also send me what I need to run what you guys have going). PR description at the top of the page has been updated, including notable and minor fixes/changes since last update. Let me know what y'all think and if it works, and what we need to do to get it in. |
|
hey @haozturk , I've addressed your comments, and tested with full RBAC set up thanks to help from you and @hassan11196 . Here are the details for your convenience:
|
| {%- endfor %} | ||
| {%- endfor %} | ||
| {%- if services.chat_app.ab_testing is defined and services.chat_app.ab_testing.enabled | default(false) %} | ||
| ab_testing: |
There was a problem hiding this comment.
Here's one problem I noticed: If I restart the app, the UI created/set variants and the experiment settings are reset, since the YAML config overwrites them. This will become a problem when we create new variants on UI and restart the app. We'll have to need to re-create these variants on every restart.
I can think of the following solution: For A/B testing config, use the YAML values during bootstrap if the A/B testing in the DB is empty. Otherwise, use the values in the DB. Keep an explicit force override mode for cases where you intentionally want YAML to overwrite A/B too. This should be false by default.
In order words, YAML remains authoritative for most static config that can't be edited in the UI. DB becomes authoritative only for UI-managed A/B settings (services.chat_app.ab_testing).
This is introducing an extra complication, but it's necessary if we want to manage these settings over the UI. Let me know what you think.
| <div class="provider-row"> | ||
| <label class="settings-field-label" for="model-select-b">Model</label> | ||
| <select id="model-select-b" class="model-select model-select-b"></select> | ||
| <p class="settings-description">Move the slider higher to help evaluate more responses, or lower to reduce interruptions. If you leave it alone, the deployment default is used.</p> |
There was a problem hiding this comment.
The word interruptions implies a negative disruption. Use something like to lower for a more standard single-response flow.
| return data; | ||
| }, | ||
|
|
||
| /** |
There was a problem hiding this comment.
Minor: Don't know where it's exactly managed, but when the quicker response finishes, its timer doesn't stop. It somehow made me think that "is there more to come?" on that response. I think we should make it stop when the response ends.
docs/docs/configuration.md
Outdated
| | `ab_agents_dir` | string | `/root/archi/ab_agents` | Optional legacy import directory for migrating A/B markdown specs into the DB catalog | | ||
| | `sample_rate` | float | `1.0` | Fraction of eligible turns that should run A/B | | ||
| | `disclosure_mode` | string | `post_vote_reveal` | One of `blind`, `post_vote_reveal`, `named` | | ||
| | `default_trace_mode` | string | `minimal` | One of `minimal`, `normal`, `verbose` | |
There was a problem hiding this comment.
As far as I understand, the following is the mapping between the config and what's shown in the UI:
minimal:Hiddennormal:Collapsedverbose:Expanded
What's in the config doesn't make sense to me. I would use in the config what's shown in the UI.
docs/docs/configuration.md
Outdated
| | `enabled` | boolean | `false` | Enable the experiment pool | | ||
| | `ab_agents_dir` | string | `/root/archi/ab_agents` | Optional legacy import directory for migrating A/B markdown specs into the DB catalog | | ||
| | `sample_rate` | float | `1.0` | Fraction of eligible turns that should run A/B | | ||
| | `disclosure_mode` | string | `post_vote_reveal` | One of `blind`, `post_vote_reveal`, `named` | |
There was a problem hiding this comment.
Can we document the mapping between what's shown in the UI and the config? Also these names aren't very intuitive. Maybe consider renaming.
|
@haozturk thanks again for the comments, I agree with them. Any UI changes to A/B testing configuration now persist across restarts. I also followed your suggestion and added a |

Champion/Challenger A/B Testing Pool
Adds deployment-scoped champion/challenger A/B testing so admins can manage an experiment pool, compare a baseline agent against configured variants, and collect user preference data.
What It Does
When A/B testing is enabled and the pool is valid, eligible turns can trigger two parallel agent runs:
Response AvsResponse B) are randomized per comparisonA,B, orTieThe current implementation is deployment-wide and server-authoritative:
services.chat_app.ab_testingmax_pending_per_conversationenforced by the backend and UICurrent Configuration Model
Configure A/B testing under:
Important differences from the original version:
services.chat_app.ab_testinglabelplusagent_spec, notnameagent_specmust be a filename underab_agents_diragents_dirNew and Updated API Endpoints
Participant and Runtime Endpoints
GET/api/ab/poolGET/api/ab/decisionPOST/api/ab/comparePOST/api/ab/preferencea,b,tie)GET/api/ab/pendingGET/api/ab/metricsAdmin Configuration Endpoints
POST/api/ab/pool/setPOST/api/ab/pool/settings/setPOST/api/ab/pool/variants/setPOST/api/ab/pool/disableCurrent UI
Admin Management
The original pool editor in the chat settings modal has been replaced by a dedicated admin page:
A/B Testing/admin/ab-testingThe admin page includes:
Save ConfigurationSave VariantsDisableParticipant Experience
For sampled turns:
Response A/Response Blabelsdisclosure_modedefault_trace_modePending Comparison Behavior
The implementation now supports queued unresolved comparisons:
max_pending_per_conversationcontrols how many unresolved comparisons a conversation may holdAgent Spec Isolation
A/B testing no longer reuses the normal user agent pool.
agents_dirab_agents_dirab_agents_diris configured in source YAML, deployment staging copies those files into the deployment and rewrites runtime config to/root/archi/ab_agentsAdmin Access and Auth
The current product path is:
That means:
Database
Current database support includes:
ab_comparisonsab_variant_metricsBackend Changes in the Current Version
src/interfaces/chat_app/app.pysrc/utils/ab_testing.pyABVariantABPoolab_agents_dirresolutionsrc/utils/conversation_service.pysrc/utils/sql.pysrc/cli/templates/base-config.yamlservices.chat_app.ab_testingrenderingab_agents_dirNotable Behavioral Differences from the Original PR Draft
/admin/ab-testingpage.agents_dirfor A/B variants.max_pending_per_conversationnow supports multiple unresolved comparisons instead of a single hardcoded pending state.Test Coverage Status
Automated coverage has been updated to reflect the current implementation rather than the original PR draft. The current test surface includes:
Notable Follow-Up Fixes
After the original feature landed, a number of follow-up fixes and product-shape changes were made:
archi restart -c ...reseed deployment config so static-config changes are actually applied on restartname-based matching to explicitlabelplusagent_specab_agents_dirinstead of sharing normalagents_dirab_agents_dirin deployment generationMath.random()logicmax_pending_per_conversation