A/B Testing by JasonMoho · Pull Request #480 · archi-physics/archi

JasonMoho · 2026-02-27T21:15:44Z

Champion/Challenger A/B Testing Pool

Adds deployment-scoped champion/challenger A/B testing so admins can manage an experiment pool, compare a baseline agent against configured variants, and collect user preference data.

What It Does

When A/B testing is enabled and the pool is valid, eligible turns can trigger two parallel agent runs:

one arm is always the configured champion
the other arm is a randomly sampled challenger from the variant pool
arm positions (Response A vs Response B) are randomized per comparison
users vote A, B, or Tie
per-variant win/loss/tie metrics are tracked in Postgres

The current implementation is deployment-wide and server-authoritative:

experiment configuration is stored under services.chat_app.ab_testing
runtime state is persisted in Postgres-backed static config, not in the original YAML after deploy
sampling decisions are made on the server
unresolved comparisons are tracked per conversation, with max_pending_per_conversation enforced by the backend and UI

Current Configuration Model

Configure A/B testing under:

services:
  chat_app:
    ab_testing:
      enabled: true
      ab_agents_dir: /root/archi/ab_agents
      sample_rate: 0.25
      disclosure_mode: post_vote_reveal
      default_trace_mode: minimal
      max_pending_per_conversation: 1
      target_roles: []
      target_permissions: []
      pool:
        champion: baseline
        variants:
          - label: baseline
            agent_spec: baseline-ab.md
          - label: poet
            agent_spec: poet-ab.md
            provider: openrouter
            model: anthropic/claude-3.5-sonnet
            recursion_limit: 30
          - label: concise
            agent_spec: concise-ab.md
            num_documents_to_retrieve: 3

Important differences from the original version:

A/B config lives only at services.chat_app.ab_testing
A/B variants use label plus agent_spec, not name
agent_spec must be a filename under ab_agents_dir
A/B agent specs are isolated from normal agents_dir
incomplete A/B config no longer blocks startup; it starts inactive and surfaces warnings in the admin UI/runtime

New and Updated API Endpoints

Participant and Runtime Endpoints

Method	Endpoint	Description
`GET`	`/api/ab/pool`	Returns participant-safe pool info, or full admin payload for admins
`GET`	`/api/ab/decision`	Server-side decision for whether the next turn should use A/B
`POST`	`/api/ab/compare`	Streams a champion-vs-challenger comparison as NDJSON
`POST`	`/api/ab/preference`	Submits the user vote (`a`, `b`, `tie`)
`GET`	`/api/ab/pending`	Returns unresolved comparisons for a conversation
`GET`	`/api/ab/metrics`	Returns aggregate per-variant metrics

Admin Configuration Endpoints

Method	Endpoint	Description
`POST`	`/api/ab/pool/set`	Legacy full-pool save path
`POST`	`/api/ab/pool/settings/set`	Saves experiment settings only
`POST`	`/api/ab/pool/variants/set`	Saves variants only
`POST`	`/api/ab/pool/disable`	Disables A/B testing while retaining saved variant config

Current UI

Admin Management

The original pool editor in the chat settings modal has been replaced by a dedicated admin page:

data page top-right action button: A/B Testing
admin route: /admin/ab-testing
page is admin-only

The admin page includes:

experiment status badge
sample rate
disclosure mode
default agent-activity visibility
max pending per conversation
champion selector
full variant list with:
- label
- A/B-only agent spec
- provider override
- model override
- recursion limit override
- document retrieval override
split save actions:
- Save Configuration
- Save Variants
Disable
add/remove variant actions
inline creation of new A/B-only agent specs from the agent-spec selector

Participant Experience

For sampled turns:

both responses stream side-by-side
the comparison uses neutral Response A / Response B labels
variant identity is shown according to disclosure_mode
agent activity visibility is controlled by default_trace_mode
voting UI appears after the comparison is ready

Pending Comparison Behavior

The implementation now supports queued unresolved comparisons:

max_pending_per_conversation controls how many unresolved comparisons a conversation may hold
the UI restores all pending comparisons on reload
input is blocked only when the configured pending limit is reached

Agent Spec Isolation

A/B testing no longer reuses the normal user agent pool.

normal user-visible agent specs live in agents_dir
A/B-only specs live in ab_agents_dir
A/B-created specs are hidden from normal user agent CRUD
when ab_agents_dir is configured in source YAML, deployment staging copies those files into the deployment and rewrites runtime config to /root/archi/ab_agents

Admin Access and Auth

The current product path is:

auth required
admin page and admin A/B endpoints are RBAC-gated
temporary basic-auth admin testing overrides were removed

That means:

basic auth remains identity-only
real A/B admin access should come from the permanent auth/RBAC path

Database

Current database support includes:

ab_comparisons
- stores each comparison
- variant metadata
- prompt and response references
- recorded preference
ab_variant_metrics
- aggregate wins
- losses
- ties
- totals

Backend Changes in the Current Version

src/interfaces/chat_app/app.py
- dedicated A/B admin page route
- admin and participant A/B APIs
- server-side A/B decision logic
- split settings/variants save paths
- queued pending-comparison handling
- runtime config refresh from persisted static config
src/utils/ab_testing.py
- ABVariant
- ABPool
- validation and pool loading
- isolated ab_agents_dir resolution
- warning-based inactive startup behavior
src/utils/conversation_service.py
- A/B comparison creation
- preference submission
- pending comparison list/count support
- metrics updates
src/utils/sql.py
- SQL for A/B comparison and metrics flows
src/cli/templates/base-config.yaml
- current services.chat_app.ab_testing rendering
- ab_agents_dir
- split current pool fields

Notable Behavioral Differences from the Original PR Draft

Admin management is no longer inside the chat settings modal; it is on a dedicated /admin/ab-testing page.
The system no longer uses shared agents_dir for A/B variants.
The admin page saves deployment-wide config through Postgres-backed static config, not in-memory only.
Sampling is server-side, not frontend-only.
max_pending_per_conversation now supports multiple unresolved comparisons instead of a single hardcoded pending state.
Incomplete A/B config warns and stays inactive instead of failing deployment startup.
The temporary basic-auth admin bridge used during development/testing has been removed.

Test Coverage Status

Automated coverage has been updated to reflect the current implementation rather than the original PR draft. The current test surface includes:

updated Playwright fixtures for current admin payloads and isolated A/B agent pools
updated A/B workflow coverage for the dedicated admin page and current selectors
coverage for queued pending comparisons
backend unit coverage for admin payloads, disable behavior, and pending-comparison enforcement
regression coverage proving the temporary basic-auth admin bridge was removed without changing permanent auth behavior

Notable Follow-Up Fixes

After the original feature landed, a number of follow-up fixes and product-shape changes were made:

moved A/B management from the chat settings modal to a dedicated admin page
split admin saves into separate settings and variants save flows
fixed runtime refresh so admin saves rehydrate from persisted Postgres config instead of stale cached state
made archi restart -c ... reseed deployment config so static-config changes are actually applied on restart
updated disable behavior so the inactive state persists and re-renders correctly
changed A/B variant config from name-based matching to explicit label plus agent_spec
isolated A/B-only agent specs into ab_agents_dir instead of sharing normal agents_dir
added staging/rendering support for ab_agents_dir in deployment generation
enabled inline creation of new A/B-only agent specs from the admin variant selector
moved server-side sampling decisions out of frontend-only Math.random() logic
expanded pending-comparison handling from a single unresolved comparison to queue-aware enforcement using max_pending_per_conversation
fixed A/B compare/runtime bugs around A/B agent directory resolution and variant initialization
cleaned up A/B arm headers and trace presentation to better match the current chat theme
fixed agent-dropdown delete confirmation behavior so the menu stays open while confirming
removed the temporary basic-auth admin-testing bridge once dedicated admin-page work was complete

- New ab_testing.py: ABVariant, ABPool with config-driven champion/challenger - DB schema: variant columns on ab_comparisons, new ab_variant_metrics table - Server: /api/ab/pool, /api/ab/compare (NDJSON streaming), /api/ab/metrics - Metrics: auto-increment win/loss/tie on preference submission - Client: pool-aware AB flow, interleaved NDJSON parsing, pool banner UI - Docs: configuration guide and API reference for new endpoints - Backward compatible: legacy manual provider-B mode still works when no pool

…move legacy provider-B

- base-config.yaml: render ab_testing config block in Jinja template - ab_testing.py: parse nested pool.champion/pool.variants config structure, add fallback to config['services']['ab_testing'] - app.py: fix NOT NULL violations (link/context cols use empty string), store user message before AB responses, add ConversationService import for variant metrics, proper FK handling for comparison records - chat.js: fix request key (message -> last_message), fix history format (use slice(-1) for nested array), fix timestamp units

langchain-ollama <1.1 drops the thinking/reasoning payload from models like qwen3 that use a thinking phase, producing hundreds of empty AIMessageChunks before the actual content arrives. Detect these empty chunks and emit a thinking_start event so the UI Agent Activity indicator stays alive during the thinking phase instead of showing a dead pause.

_stream_arm now emits properly structured events with tool_call_id, tool_name, tool_args at top level (matching what renderToolStart expects) instead of nesting everything in metadata. Also: - Add thinking_start/thinking_end handling to JS A/B event loop - Add step-timeline container to A/B trace HTML so tool rendering works - Extract tool calls from both memory (tool_inputs_by_id) and AIMessage fallback - Extract tool_output content from ToolMessage directly

Both Chat.stream() and _stream_arm() (A/B testing) now use a single PipelineEventFormatter class for converting PipelineOutput into structured JSON events. This eliminates ~200 lines of duplicated tool-extraction logic that had already diverged between the two paths. Key changes: - New event_formatter.py with PipelineEventFormatter class - Deferred tool_start emission (emits with output for stable ordering) - Progressive merging from tool_calls, additional_kwargs, tool_call_chunks - Stateful tracking of emitted/pending tool IDs - app.py: Chat.stream() inner loop reduced from ~260 to ~50 lines - app.py: _stream_arm() reduced from ~140 to ~20 lines - chat.js: extracted _renderStreamEvent() shared by both handlers - Removed dead code block (unreachable backfill after return) Deployed to submit76, verified regular streaming (with tools) and A/B comparison streaming both work identically.

HIGH severity fixes: - Extract _error_event() helper (3 sites) - Extract _ab_comparison_from_row() helper (3 sites) - Extract _trace_from_row() helper (3 sites) - Consolidate agent_class resolution to _get_agent_class_from_cfg() (6 sites) MEDIUM severity fixes: - Extract _ndjson_response() on FlaskAppWrapper (2 sites) - Merge _delete_git_repo/_delete_jira_project into _delete_source_documents() - Shared _readNDJSON() async generator in chat.js (2 sites, fixes buffer drain bug) - Merge like/dislike into _toggle_reaction() + _with_feedback_lock() - Dedup format_links score logic via _format_score_str() Net: -152 lines. All endpoints verified on submit76.

…ecution - Add keep_alive='24h' to ChatOllama model creation so Ollama retains models in VRAM between requests (default was 5 min eviction) - Add _prewarm_ab_models() to pre-load all AB variant models on startup via background threads calling Ollama /api/generate - Add timing instrumentation to _stream_arm (thread start, vectorstore ready, first event, and completion timestamps) - Both arms now start within 0.1-0.3s of each other (was ~19s+ gap due to cold model loading and sequential Ollama inference)

…-param signature), unused delete/list methods, dead /api/ab/create endpoint

…or non-admins - Add _get_request_client_id / _is_admin_request helpers to FlaskAppWrapper - ab_get_pool: returns is_admin flag, only reveals pool details to admins - ab_compare_stream: 403 for non-admins - ab_get_metrics: 403 for non-admins - ab_submit_preference / ab_get_pending: remain open (voting is ok) - Frontend: pass client_id to getABPool/getABMetrics - Frontend: hide entire A/B settings section for non-admins - Template: wrap A/B toggle in #ab-settings-section, hidden by default

- New clone icon button in agent dropdown (copy icon, between name and edit) - Clone mode opens editor pre-filled with source agent's name/prompt/tools - Name auto-set to '<Original> (variant)' with text selected for easy rename - Saves as new agent (mode: create) — no backend changes needed - CSS: clone button hover uses primary green color

Replace the useless toggle + read-only banner with an interactive pool editor in Settings → Advanced: - Lists all available agents with checkboxes - Click 'Champion' button to designate baseline agent - 'Save Pool' button calls new POST /api/ab/pool/set endpoint - 'Disable' button calls POST /api/ab/pool/disable - Status badge shows Active/Inactive - Validation: requires 2+ agents selected and one champion Backend: - POST /api/ab/pool/set: admin-only, builds ABPool from agent names - POST /api/ab/pool/disable: admin-only, clears pool - Both update ab_pool in-memory at runtime (no config file edit needed) A/B mode now auto-activates when pool is saved (no separate toggle).

Option C implementation: - ab_only field in AgentSpec: agents with ab_only: true in frontmatter are hidden from the main agent dropdown but appear in the pool editor - Quick variant button (+) on each agent row in pool editor: opens inline panel to create a variant with tool toggles, saved with ab_only: true - AB badge: agents marked ab_only show a blue 'AB' pill in the pool editor - Client-side duplicate name check in variant panel (backend already has 409) - Pool editor now uses allAgents (includes ab_only) for full visibility

…t dead code

# Conflicts: # src/cli/templates/base-config.yaml # src/interfaces/chat_app/app.py # src/interfaces/chat_app/static/chat.js # src/interfaces/chat_app/templates/index.html

src/interfaces/chat_app/app.py

haozturk · 2026-03-03T15:56:14Z

Thanks Jason. First impressions: Couple of issues I noticed:

I don't see where to see select the liked response. The feedback buttons disappeared
Timer doesn't work
It's not clear which response belongs to which config/model.
Agent activity is shown by default. I like it better when it's not shown by default but the user can see them if they like
Can we configure the frequency? I don't think we want to do it for every message.

… trace, config nesting - Add like/dislike/comment feedback buttons to A/B comparison arms - Start and stop trace timers for A/B arm messages - Emit ab_arms event early so frontend shows variant names immediately - Start agent activity trace collapsed by default in normal mode - Add CSS for variant name labels on A/B arms - Move ab_testing config under services.chat_app (preferred path) - Update load_ab_pool() to check services.chat_app.ab_testing first - Keep legacy services.ab_testing fallback for backward compatibility

Use .get() with sensible defaults for host, hostname, template_folder, static_folder, num_responses_until_feedback — these may not be present in Postgres if config was seeded without CLI-added fields.

Copilot

Pull request overview

Adds a server-driven champion/challenger A/B testing “pool” to the chat app, including admin configuration UI, a new comparison streaming endpoint, and Postgres-backed per-variant metrics.

Changes:

Introduces ABPool/ABVariant utilities and server endpoints for pool config, streaming comparisons, preference submission, and metrics.
Refactors frontend A/B behavior to stream a single interleaved NDJSON comparison (two arms) and adds an admin-only pool editor + quick variant creation.
Extends DB schema and SQL to store variant metadata on comparisons and maintain aggregate variant metrics; adds Playwright E2E coverage.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
`tests/ui/workflows/21-ab-testing.spec.ts`	New E2E workflows covering admin gating, pool editor, comparison streaming, and voting.
`tests/ui/fixtures.ts`	Adds A/B-related mock data and route helpers for Playwright.
`src/utils/sql.py`	Extends A/B comparison queries and adds variant-metrics upsert/select queries.
`src/utils/conversation_service.py`	Stores/reads variant metadata on comparisons; adds variant-metrics update/query helpers.
`src/utils/ab_testing.py`	New pool/variant config loader, validation, and challenger sampling logic.
`src/interfaces/chat_app/templates/index.html`	Replaces old A/B toggle UI with admin-only pool editor section.
`src/interfaces/chat_app/static/chat.js`	Adds pool endpoints, NDJSON reader, pool editor UI, A/B stream handling, and “clone as variant” UX.
`src/interfaces/chat_app/static/chat.css`	Styles pool editor, quick-variant panel, and new A/B arm layout.
`src/interfaces/chat_app/event_formatter.py`	New shared formatter to unify streaming event shaping across normal and A/B streaming.
`src/interfaces/chat_app/app.py`	Adds pool/compare/metrics endpoints; implements threaded A/B streaming; integrates formatter; refactors helpers.
`src/cli/templates/init.sql`	Adds variant metadata columns to `ab_comparisons` and creates `ab_variant_metrics`.
`src/cli/templates/base-config.yaml`	Adds template support for `ab_testing` configuration.
`src/bin/service_chat.py`	Makes service startup more robust to missing config keys; resolves template/static paths.
`src/archi/providers/local_provider.py`	Changes Ollama model kwargs to keep models alive for longer.
`src/archi/pipelines/agents/base_react.py`	Emits implicit thinking events when providers produce empty chunks.
`src/archi/pipelines/agents/agent_spec.py`	Adds `ab_only` frontmatter flag support to agent specs.
`docs/docs/configuration.md`	Documents A/B testing pool configuration and variant fields.
`docs/docs/api_reference.md`	Documents pool/compare/metrics endpoints (and notes legacy).
`configs/submit76/config.yaml`	Adds a deployment config example enabling the A/B pool.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/archi/providers/local_provider.py

src/interfaces/chat_app/app.py

docs/docs/configuration.md

src/cli/templates/base-config.yaml

src/utils/ab_testing.py

src/interfaces/chat_app/static/chat.js

src/utils/ab_testing.py

src/interfaces/chat_app/app.py

docs/docs/api_reference.md

tests/ui/workflows/21-ab-testing.spec.ts

pmlugato · 2026-03-20T16:05:04Z

@haozturk @hassan11196 @JasonMoho The PR has been updated and is ready to be reviewed now. One thing that needs your testing (Hasan, Hassan) is the integration with the RBAC setup, since I don't have the according secrets, config entries, whatever else, to launch this myself. I set up a proxy admin role to test, which worked, but not the real thing (you can also send me what I need to run what you guys have going).

PR description at the top of the page has been updated, including notable and minor fixes/changes since last update.

Let me know what y'all think and if it works, and what we need to do to get it in.

haozturk

Overall looks very good, thank you @pmlugato ! There's a few changes I requested inline.

src/interfaces/chat_app/app.py

src/interfaces/chat_app/templates/ab_testing.html

src/interfaces/chat_app/app.py

pmlugato · 2026-04-01T21:25:27Z

hey @haozturk , I've addressed your comments, and tested with full RBAC set up thanks to help from you and @hassan11196 . Here are the details for your convenience:

Added the following A/B permissions:
- ab:participate
- ab:view
- ab:metrics
- ab:manage
Added a per-user A/B participation slider in Chat Settings:
- deployment sample_rate remains the default
- each participating user can override their own rate from 0-1
- the setting is stored per user in Postgres
Kept runtime A/B spec resolution Postgres-only:
- staged ab_agents_dir files are import inputs only
- runtime reads from the database only
ab:participate
- can be sampled into A/B comparisons
- can see and use the personal participation-rate slider
ab:view
- can open the A/B Testing page in read-only mode
ab:metrics
- can open the A/B Testing page
- can view aggregate comparison metrics
ab:manage
- can change experiment settings
- can manage variants
- can create/delete A/B agent specs

target_roles and target_permissions are additional, optional filters on top of ab:participate.

If neither is set, all users with ab:participate are eligible.
If only target_roles is set, the user must match at least one listed role.
If only target_permissions is set, the user must match at least one listed permission.
If both are set, the user must satisfy both filters.

haozturk

Thanks a lot @pmlugato looks almost ready. See my inline comments. The restart issue looks major, others are minor.

haozturk · 2026-04-02T18:32:40Z

src/cli/templates/base-config.yaml

-        {%- endfor %} 
+        {%- endfor %}
+    {%- if services.chat_app.ab_testing is defined and services.chat_app.ab_testing.enabled | default(false) %}
+    ab_testing:


Here's one problem I noticed: If I restart the app, the UI created/set variants and the experiment settings are reset, since the YAML config overwrites them. This will become a problem when we create new variants on UI and restart the app. We'll have to need to re-create these variants on every restart.

I can think of the following solution: For A/B testing config, use the YAML values during bootstrap if the A/B testing in the DB is empty. Otherwise, use the values in the DB. Keep an explicit force override mode for cases where you intentionally want YAML to overwrite A/B too. This should be false by default.

In order words, YAML remains authoritative for most static config that can't be edited in the UI. DB becomes authoritative only for UI-managed A/B settings (services.chat_app.ab_testing).

This is introducing an extra complication, but it's necessary if we want to manage these settings over the UI. Let me know what you think.

haozturk · 2026-04-02T18:40:16Z

src/interfaces/chat_app/templates/index.html

-                <div class="provider-row">
-                  <label class="settings-field-label" for="model-select-b">Model</label>
-                  <select id="model-select-b" class="model-select model-select-b"></select>
+                <p class="settings-description">Move the slider higher to help evaluate more responses, or lower to reduce interruptions. If you leave it alone, the deployment default is used.</p>


The word interruptions implies a negative disruption. Use something like to lower for a more standard single-response flow.

haozturk · 2026-04-02T18:46:34Z

src/interfaces/chat_app/static/chat.js

    return data;
  },

+  /**


Minor: Don't know where it's exactly managed, but when the quicker response finishes, its timer doesn't stop. It somehow made me think that "is there more to come?" on that response. I think we should make it stop when the response ends.

haozturk · 2026-04-02T19:26:32Z

docs/docs/configuration.md

+| `ab_agents_dir` | string | `/root/archi/ab_agents` | Optional legacy import directory for migrating A/B markdown specs into the DB catalog |
+| `sample_rate` | float | `1.0` | Fraction of eligible turns that should run A/B |
+| `disclosure_mode` | string | `post_vote_reveal` | One of `blind`, `post_vote_reveal`, `named` |
+| `default_trace_mode` | string | `minimal` | One of `minimal`, `normal`, `verbose` |


As far as I understand, the following is the mapping between the config and what's shown in the UI:

minimal: Hidden

normal: Collapsed

verbose: Expanded

What's in the config doesn't make sense to me. I would use in the config what's shown in the UI.

haozturk · 2026-04-02T19:31:51Z

docs/docs/configuration.md

+| `enabled` | boolean | `false` | Enable the experiment pool |
+| `ab_agents_dir` | string | `/root/archi/ab_agents` | Optional legacy import directory for migrating A/B markdown specs into the DB catalog |
+| `sample_rate` | float | `1.0` | Fraction of eligible turns that should run A/B |
+| `disclosure_mode` | string | `post_vote_reveal` | One of `blind`, `post_vote_reveal`, `named` |


Can we document the mapping between what's shown in the UI and the config? Also these names aren't very intuitive. Maybe consider renaming.

…more

pmlugato · 2026-04-03T19:50:10Z

@haozturk thanks again for the comments, I agree with them. Any UI changes to A/B testing configuration now persist across restarts. I also followed your suggestion and added a force_yaml_override option should it be needed. Timer is fixed, and naming is clearer and consistent between config/UI/docs. Thanks for the good testing, let me know if you find anything else or if we're good to go.

JasonMoho added 16 commits February 25, 2026 14:23

fix AB UX: real-time streaming, normal message layout, tie voting, re…

55fc892

…move legacy provider-B

remove dead A/B code: stale ChatWrapper.create_ab_comparison (wrong 7…

48059e8

…-param signature), unused delete/list methods, dead /api/ab/create endpoint

cleanup: remove prewarm, deduplicate A/B methods via conv_service, cu…

8c756f0

…t dead code

Merge remote-tracking branch 'origin/main' into dev-ab-testing-pool

59b11c2

# Conflicts: # src/cli/templates/base-config.yaml # src/interfaces/chat_app/app.py # src/interfaces/chat_app/static/chat.js # src/interfaces/chat_app/templates/index.html

remove docs not belonging to this PR

f4dd773

JasonMoho added the enhancement New feature or request label Feb 27, 2026

add Playwright E2E tests for A/B testing (33 tests)

1950644

pmlugato linked an issue Mar 3, 2026 that may be closed by this pull request

update A/B testing #476

Open

lucalavezzo self-requested a review March 3, 2026 15:39

haozturk reviewed Mar 3, 2026

View reviewed changes

src/interfaces/chat_app/app.py Show resolved Hide resolved

JasonMoho added 2 commits March 3, 2026 14:02

make service_chat.py resilient to missing deploy-time config keys

56e8053

Use .get() with sensible defaults for host, hostname, template_folder, static_folder, num_responses_until_feedback — these may not be present in Postgres if config was seeded without CLI-added fields.

swinney requested a review from Copilot March 16, 2026 20:05

Copilot started reviewing on behalf of swinney March 16, 2026 20:05 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

pmlugato self-assigned this Mar 18, 2026

pmlugato added 2 commits March 18, 2026 17:44

Merge branch 'main' into dev-ab-testing-pool

8610164

continued dev, catch up, bug fixes for ab, restart, UI/UX

2490364

pmlugato added 4 commits March 19, 2026 15:43

adding admin ab testing configuration page

f95a3a1

updates, fixes

ad2f270

remove temp basic admin role used for tests, update ui+unit tests

555b818

pass unit and playwright

578fee7

pmlugato added 2 commits March 20, 2026 12:14

unit and playwright pt 2

30af401

playWRONG >:(

5b4c876

haozturk requested changes Mar 27, 2026

View reviewed changes

src/interfaces/chat_app/app.py Show resolved Hide resolved

src/interfaces/chat_app/templates/ab_testing.html Outdated Show resolved Hide resolved

src/interfaces/chat_app/app.py Show resolved Hide resolved

pmlugato added 6 commits April 1, 2026 12:50

added RBAC, store ab specs in postgres, per user ab sample rate slider

918cbe2

fixes and improvements

cb4ea9f

unit and playwright tests

f35b257

remaining test failures fixed

36a03af

playwright updates

dabe308

more playwright

9fb333a

haozturk requested changes Apr 2, 2026

View reviewed changes

pmlugato added 2 commits April 3, 2026 15:20

persist UI A/B changes across restarts, better naming, timing fixes, …

a5e3931

…more

unit and playwright

f88a549

last playwright

15c5c9e

Conversation

JasonMoho commented Feb 27, 2026 • edited by pmlugato Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Champion/Challenger A/B Testing Pool

What It Does

Current Configuration Model

New and Updated API Endpoints

Participant and Runtime Endpoints

Admin Configuration Endpoints

Current UI

Admin Management

Participant Experience

Pending Comparison Behavior

Agent Spec Isolation

Admin Access and Auth

Database

Backend Changes in the Current Version

Notable Behavioral Differences from the Original PR Draft

Test Coverage Status

Notable Follow-Up Fixes

Uh oh!

Uh oh!

haozturk commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pmlugato commented Mar 20, 2026

Uh oh!

haozturk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pmlugato commented Apr 1, 2026

Uh oh!

haozturk left a comment

Choose a reason for hiding this comment

Uh oh!

haozturk Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

haozturk Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

haozturk Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

haozturk Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

haozturk Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

pmlugato commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JasonMoho commented Feb 27, 2026 •

edited by pmlugato

Loading

haozturk commented Mar 3, 2026 •

edited

Loading