Skip to content

A/B Testing#480

Open
JasonMoho wants to merge 36 commits intomainfrom
dev-ab-testing-pool
Open

A/B Testing#480
JasonMoho wants to merge 36 commits intomainfrom
dev-ab-testing-pool

Conversation

@JasonMoho
Copy link
Copy Markdown
Collaborator

@JasonMoho JasonMoho commented Feb 27, 2026

Champion/Challenger A/B Testing Pool

Adds deployment-scoped champion/challenger A/B testing so admins can manage an experiment pool, compare a baseline agent against configured variants, and collect user preference data.

What It Does

When A/B testing is enabled and the pool is valid, eligible turns can trigger two parallel agent runs:

  • one arm is always the configured champion
  • the other arm is a randomly sampled challenger from the variant pool
  • arm positions (Response A vs Response B) are randomized per comparison
  • users vote A, B, or Tie
  • per-variant win/loss/tie metrics are tracked in Postgres

The current implementation is deployment-wide and server-authoritative:

  • experiment configuration is stored under services.chat_app.ab_testing
  • runtime state is persisted in Postgres-backed static config, not in the original YAML after deploy
  • sampling decisions are made on the server
  • unresolved comparisons are tracked per conversation, with max_pending_per_conversation enforced by the backend and UI

Current Configuration Model

Configure A/B testing under:

services:
  chat_app:
    ab_testing:
      enabled: true
      ab_agents_dir: /root/archi/ab_agents
      sample_rate: 0.25
      disclosure_mode: post_vote_reveal
      default_trace_mode: minimal
      max_pending_per_conversation: 1
      target_roles: []
      target_permissions: []
      pool:
        champion: baseline
        variants:
          - label: baseline
            agent_spec: baseline-ab.md
          - label: poet
            agent_spec: poet-ab.md
            provider: openrouter
            model: anthropic/claude-3.5-sonnet
            recursion_limit: 30
          - label: concise
            agent_spec: concise-ab.md
            num_documents_to_retrieve: 3

Important differences from the original version:

  • A/B config lives only at services.chat_app.ab_testing
  • A/B variants use label plus agent_spec, not name
  • agent_spec must be a filename under ab_agents_dir
  • A/B agent specs are isolated from normal agents_dir
  • incomplete A/B config no longer blocks startup; it starts inactive and surfaces warnings in the admin UI/runtime

New and Updated API Endpoints

Participant and Runtime Endpoints

Method Endpoint Description
GET /api/ab/pool Returns participant-safe pool info, or full admin payload for admins
GET /api/ab/decision Server-side decision for whether the next turn should use A/B
POST /api/ab/compare Streams a champion-vs-challenger comparison as NDJSON
POST /api/ab/preference Submits the user vote (a, b, tie)
GET /api/ab/pending Returns unresolved comparisons for a conversation
GET /api/ab/metrics Returns aggregate per-variant metrics

Admin Configuration Endpoints

Method Endpoint Description
POST /api/ab/pool/set Legacy full-pool save path
POST /api/ab/pool/settings/set Saves experiment settings only
POST /api/ab/pool/variants/set Saves variants only
POST /api/ab/pool/disable Disables A/B testing while retaining saved variant config

Current UI

Admin Management

The original pool editor in the chat settings modal has been replaced by a dedicated admin page:

  • data page top-right action button: A/B Testing
  • admin route: /admin/ab-testing
  • page is admin-only

The admin page includes:

  • experiment status badge
  • sample rate
  • disclosure mode
  • default agent-activity visibility
  • max pending per conversation
  • champion selector
  • full variant list with:
    • label
    • A/B-only agent spec
    • provider override
    • model override
    • recursion limit override
    • document retrieval override
  • split save actions:
    • Save Configuration
    • Save Variants
  • Disable
  • add/remove variant actions
  • inline creation of new A/B-only agent specs from the agent-spec selector

Participant Experience

For sampled turns:

  • both responses stream side-by-side
  • the comparison uses neutral Response A / Response B labels
  • variant identity is shown according to disclosure_mode
  • agent activity visibility is controlled by default_trace_mode
  • voting UI appears after the comparison is ready

Pending Comparison Behavior

The implementation now supports queued unresolved comparisons:

  • max_pending_per_conversation controls how many unresolved comparisons a conversation may hold
  • the UI restores all pending comparisons on reload
  • input is blocked only when the configured pending limit is reached

Agent Spec Isolation

A/B testing no longer reuses the normal user agent pool.

  • normal user-visible agent specs live in agents_dir
  • A/B-only specs live in ab_agents_dir
  • A/B-created specs are hidden from normal user agent CRUD
  • when ab_agents_dir is configured in source YAML, deployment staging copies those files into the deployment and rewrites runtime config to /root/archi/ab_agents

Admin Access and Auth

The current product path is:

  • auth required
  • admin page and admin A/B endpoints are RBAC-gated
  • temporary basic-auth admin testing overrides were removed

That means:

  • basic auth remains identity-only
  • real A/B admin access should come from the permanent auth/RBAC path

Database

Current database support includes:

  • ab_comparisons
    • stores each comparison
    • variant metadata
    • prompt and response references
    • recorded preference
  • ab_variant_metrics
    • aggregate wins
    • losses
    • ties
    • totals

Backend Changes in the Current Version

  • src/interfaces/chat_app/app.py
    • dedicated A/B admin page route
    • admin and participant A/B APIs
    • server-side A/B decision logic
    • split settings/variants save paths
    • queued pending-comparison handling
    • runtime config refresh from persisted static config
  • src/utils/ab_testing.py
    • ABVariant
    • ABPool
    • validation and pool loading
    • isolated ab_agents_dir resolution
    • warning-based inactive startup behavior
  • src/utils/conversation_service.py
    • A/B comparison creation
    • preference submission
    • pending comparison list/count support
    • metrics updates
  • src/utils/sql.py
    • SQL for A/B comparison and metrics flows
  • src/cli/templates/base-config.yaml
    • current services.chat_app.ab_testing rendering
    • ab_agents_dir
    • split current pool fields

Notable Behavioral Differences from the Original PR Draft

  • Admin management is no longer inside the chat settings modal; it is on a dedicated /admin/ab-testing page.
  • The system no longer uses shared agents_dir for A/B variants.
  • The admin page saves deployment-wide config through Postgres-backed static config, not in-memory only.
  • Sampling is server-side, not frontend-only.
  • max_pending_per_conversation now supports multiple unresolved comparisons instead of a single hardcoded pending state.
  • Incomplete A/B config warns and stays inactive instead of failing deployment startup.
  • The temporary basic-auth admin bridge used during development/testing has been removed.

Test Coverage Status

Automated coverage has been updated to reflect the current implementation rather than the original PR draft. The current test surface includes:

  • updated Playwright fixtures for current admin payloads and isolated A/B agent pools
  • updated A/B workflow coverage for the dedicated admin page and current selectors
  • coverage for queued pending comparisons
  • backend unit coverage for admin payloads, disable behavior, and pending-comparison enforcement
  • regression coverage proving the temporary basic-auth admin bridge was removed without changing permanent auth behavior

Notable Follow-Up Fixes

After the original feature landed, a number of follow-up fixes and product-shape changes were made:

  • moved A/B management from the chat settings modal to a dedicated admin page
  • split admin saves into separate settings and variants save flows
  • fixed runtime refresh so admin saves rehydrate from persisted Postgres config instead of stale cached state
  • made archi restart -c ... reseed deployment config so static-config changes are actually applied on restart
  • updated disable behavior so the inactive state persists and re-renders correctly
  • changed A/B variant config from name-based matching to explicit label plus agent_spec
  • isolated A/B-only agent specs into ab_agents_dir instead of sharing normal agents_dir
  • added staging/rendering support for ab_agents_dir in deployment generation
  • enabled inline creation of new A/B-only agent specs from the admin variant selector
  • moved server-side sampling decisions out of frontend-only Math.random() logic
  • expanded pending-comparison handling from a single unresolved comparison to queue-aware enforcement using max_pending_per_conversation
  • fixed A/B compare/runtime bugs around A/B agent directory resolution and variant initialization
  • cleaned up A/B arm headers and trace presentation to better match the current chat theme
  • fixed agent-dropdown delete confirmation behavior so the menu stays open while confirming
  • removed the temporary basic-auth admin-testing bridge once dedicated admin-page work was complete

- New ab_testing.py: ABVariant, ABPool with config-driven champion/challenger
- DB schema: variant columns on ab_comparisons, new ab_variant_metrics table
- Server: /api/ab/pool, /api/ab/compare (NDJSON streaming), /api/ab/metrics
- Metrics: auto-increment win/loss/tie on preference submission
- Client: pool-aware AB flow, interleaved NDJSON parsing, pool banner UI
- Docs: configuration guide and API reference for new endpoints
- Backward compatible: legacy manual provider-B mode still works when no pool
- base-config.yaml: render ab_testing config block in Jinja template
- ab_testing.py: parse nested pool.champion/pool.variants config structure,
  add fallback to config['services']['ab_testing']
- app.py: fix NOT NULL violations (link/context cols use empty string),
  store user message before AB responses, add ConversationService import
  for variant metrics, proper FK handling for comparison records
- chat.js: fix request key (message -> last_message), fix history format
  (use slice(-1) for nested array), fix timestamp units
langchain-ollama <1.1 drops the thinking/reasoning payload from
models like qwen3 that use a thinking phase, producing hundreds
of empty AIMessageChunks before the actual content arrives.

Detect these empty chunks and emit a thinking_start event so
the UI Agent Activity indicator stays alive during the thinking
phase instead of showing a dead pause.
_stream_arm now emits properly structured events with tool_call_id,
tool_name, tool_args at top level (matching what renderToolStart expects)
instead of nesting everything in metadata. Also:

- Add thinking_start/thinking_end handling to JS A/B event loop
- Add step-timeline container to A/B trace HTML so tool rendering works
- Extract tool calls from both memory (tool_inputs_by_id) and AIMessage
  fallback
- Extract tool_output content from ToolMessage directly
Both Chat.stream() and _stream_arm() (A/B testing) now use a single
PipelineEventFormatter class for converting PipelineOutput into
structured JSON events. This eliminates ~200 lines of duplicated
tool-extraction logic that had already diverged between the two paths.

Key changes:
- New event_formatter.py with PipelineEventFormatter class
  - Deferred tool_start emission (emits with output for stable ordering)
  - Progressive merging from tool_calls, additional_kwargs, tool_call_chunks
  - Stateful tracking of emitted/pending tool IDs
- app.py: Chat.stream() inner loop reduced from ~260 to ~50 lines
- app.py: _stream_arm() reduced from ~140 to ~20 lines
- chat.js: extracted _renderStreamEvent() shared by both handlers
- Removed dead code block (unreachable backfill after return)

Deployed to submit76, verified regular streaming (with tools) and
A/B comparison streaming both work identically.
HIGH severity fixes:
- Extract _error_event() helper (3 sites)
- Extract _ab_comparison_from_row() helper (3 sites)
- Extract _trace_from_row() helper (3 sites)
- Consolidate agent_class resolution to _get_agent_class_from_cfg() (6 sites)

MEDIUM severity fixes:
- Extract _ndjson_response() on FlaskAppWrapper (2 sites)
- Merge _delete_git_repo/_delete_jira_project into _delete_source_documents()
- Shared _readNDJSON() async generator in chat.js (2 sites, fixes buffer drain bug)
- Merge like/dislike into _toggle_reaction() + _with_feedback_lock()
- Dedup format_links score logic via _format_score_str()

Net: -152 lines. All endpoints verified on submit76.
…ecution

- Add keep_alive='24h' to ChatOllama model creation so Ollama retains
  models in VRAM between requests (default was 5 min eviction)
- Add _prewarm_ab_models() to pre-load all AB variant models on startup
  via background threads calling Ollama /api/generate
- Add timing instrumentation to _stream_arm (thread start, vectorstore
  ready, first event, and completion timestamps)
- Both arms now start within 0.1-0.3s of each other (was ~19s+ gap
  due to cold model loading and sequential Ollama inference)
…-param signature), unused delete/list methods, dead /api/ab/create endpoint
…or non-admins

- Add _get_request_client_id / _is_admin_request helpers to FlaskAppWrapper
- ab_get_pool: returns is_admin flag, only reveals pool details to admins
- ab_compare_stream: 403 for non-admins
- ab_get_metrics: 403 for non-admins
- ab_submit_preference / ab_get_pending: remain open (voting is ok)
- Frontend: pass client_id to getABPool/getABMetrics
- Frontend: hide entire A/B settings section for non-admins
- Template: wrap A/B toggle in #ab-settings-section, hidden by default
- New clone icon button in agent dropdown (copy icon, between name and edit)
- Clone mode opens editor pre-filled with source agent's name/prompt/tools
- Name auto-set to '<Original> (variant)' with text selected for easy rename
- Saves as new agent (mode: create) — no backend changes needed
- CSS: clone button hover uses primary green color
Replace the useless toggle + read-only banner with an interactive pool editor
in Settings → Advanced:
- Lists all available agents with checkboxes
- Click 'Champion' button to designate baseline agent
- 'Save Pool' button calls new POST /api/ab/pool/set endpoint
- 'Disable' button calls POST /api/ab/pool/disable
- Status badge shows Active/Inactive
- Validation: requires 2+ agents selected and one champion

Backend:
- POST /api/ab/pool/set: admin-only, builds ABPool from agent names
- POST /api/ab/pool/disable: admin-only, clears pool
- Both update ab_pool in-memory at runtime (no config file edit needed)

A/B mode now auto-activates when pool is saved (no separate toggle).
Option C implementation:
- ab_only field in AgentSpec: agents with ab_only: true in frontmatter are
  hidden from the main agent dropdown but appear in the pool editor
- Quick variant button (+) on each agent row in pool editor: opens inline
  panel to create a variant with tool toggles, saved with ab_only: true
- AB badge: agents marked ab_only show a blue 'AB' pill in the pool editor
- Client-side duplicate name check in variant panel (backend already has 409)
- Pool editor now uses allAgents (includes ab_only) for full visibility
# Conflicts:
#	src/cli/templates/base-config.yaml
#	src/interfaces/chat_app/app.py
#	src/interfaces/chat_app/static/chat.js
#	src/interfaces/chat_app/templates/index.html
@JasonMoho JasonMoho added the enhancement New feature or request label Feb 27, 2026
@pmlugato pmlugato linked an issue Mar 3, 2026 that may be closed by this pull request
@lucalavezzo lucalavezzo self-requested a review March 3, 2026 15:39
@haozturk
Copy link
Copy Markdown
Collaborator

haozturk commented Mar 3, 2026

image

Thanks Jason. First impressions: Couple of issues I noticed:

  • I don't see where to see select the liked response. The feedback buttons disappeared
  • Timer doesn't work
  • It's not clear which response belongs to which config/model.
  • Agent activity is shown by default. I like it better when it's not shown by default but the user can see them if they like
  • Can we configure the frequency? I don't think we want to do it for every message.

… trace, config nesting

- Add like/dislike/comment feedback buttons to A/B comparison arms
- Start and stop trace timers for A/B arm messages
- Emit ab_arms event early so frontend shows variant names immediately
- Start agent activity trace collapsed by default in normal mode
- Add CSS for variant name labels on A/B arms
- Move ab_testing config under services.chat_app (preferred path)
- Update load_ab_pool() to check services.chat_app.ab_testing first
- Keep legacy services.ab_testing fallback for backward compatibility
Use .get() with sensible defaults for host, hostname, template_folder,
static_folder, num_responses_until_feedback — these may not be present
in Postgres if config was seeded without CLI-added fields.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a server-driven champion/challenger A/B testing “pool” to the chat app, including admin configuration UI, a new comparison streaming endpoint, and Postgres-backed per-variant metrics.

Changes:

  • Introduces ABPool/ABVariant utilities and server endpoints for pool config, streaming comparisons, preference submission, and metrics.
  • Refactors frontend A/B behavior to stream a single interleaved NDJSON comparison (two arms) and adds an admin-only pool editor + quick variant creation.
  • Extends DB schema and SQL to store variant metadata on comparisons and maintain aggregate variant metrics; adds Playwright E2E coverage.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
tests/ui/workflows/21-ab-testing.spec.ts New E2E workflows covering admin gating, pool editor, comparison streaming, and voting.
tests/ui/fixtures.ts Adds A/B-related mock data and route helpers for Playwright.
src/utils/sql.py Extends A/B comparison queries and adds variant-metrics upsert/select queries.
src/utils/conversation_service.py Stores/reads variant metadata on comparisons; adds variant-metrics update/query helpers.
src/utils/ab_testing.py New pool/variant config loader, validation, and challenger sampling logic.
src/interfaces/chat_app/templates/index.html Replaces old A/B toggle UI with admin-only pool editor section.
src/interfaces/chat_app/static/chat.js Adds pool endpoints, NDJSON reader, pool editor UI, A/B stream handling, and “clone as variant” UX.
src/interfaces/chat_app/static/chat.css Styles pool editor, quick-variant panel, and new A/B arm layout.
src/interfaces/chat_app/event_formatter.py New shared formatter to unify streaming event shaping across normal and A/B streaming.
src/interfaces/chat_app/app.py Adds pool/compare/metrics endpoints; implements threaded A/B streaming; integrates formatter; refactors helpers.
src/cli/templates/init.sql Adds variant metadata columns to ab_comparisons and creates ab_variant_metrics.
src/cli/templates/base-config.yaml Adds template support for ab_testing configuration.
src/bin/service_chat.py Makes service startup more robust to missing config keys; resolves template/static paths.
src/archi/providers/local_provider.py Changes Ollama model kwargs to keep models alive for longer.
src/archi/pipelines/agents/base_react.py Emits implicit thinking events when providers produce empty chunks.
src/archi/pipelines/agents/agent_spec.py Adds ab_only frontmatter flag support to agent specs.
docs/docs/configuration.md Documents A/B testing pool configuration and variant fields.
docs/docs/api_reference.md Documents pool/compare/metrics endpoints (and notes legacy).
configs/submit76/config.yaml Adds a deployment config example enabling the A/B pool.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pmlugato pmlugato self-assigned this Mar 18, 2026
@pmlugato
Copy link
Copy Markdown
Collaborator

@haozturk @hassan11196 @JasonMoho The PR has been updated and is ready to be reviewed now. One thing that needs your testing (Hasan, Hassan) is the integration with the RBAC setup, since I don't have the according secrets, config entries, whatever else, to launch this myself. I set up a proxy admin role to test, which worked, but not the real thing (you can also send me what I need to run what you guys have going).

PR description at the top of the page has been updated, including notable and minor fixes/changes since last update.

Let me know what y'all think and if it works, and what we need to do to get it in.

Copy link
Copy Markdown
Collaborator

@haozturk haozturk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks very good, thank you @pmlugato ! There's a few changes I requested inline.

@pmlugato
Copy link
Copy Markdown
Collaborator

pmlugato commented Apr 1, 2026

hey @haozturk , I've addressed your comments, and tested with full RBAC set up thanks to help from you and @hassan11196 . Here are the details for your convenience:

  • Added the following A/B permissions:

    • ab:participate
    • ab:view
    • ab:metrics
    • ab:manage
  • Added a per-user A/B participation slider in Chat Settings:

    • deployment sample_rate remains the default
    • each participating user can override their own rate from 0-1
    • the setting is stored per user in Postgres
  • Kept runtime A/B spec resolution Postgres-only:

    • staged ab_agents_dir files are import inputs only
    • runtime reads from the database only
  • ab:participate

    • can be sampled into A/B comparisons
    • can see and use the personal participation-rate slider
  • ab:view

    • can open the A/B Testing page in read-only mode
  • ab:metrics

    • can open the A/B Testing page
    • can view aggregate comparison metrics
  • ab:manage

    • can change experiment settings
    • can manage variants
    • can create/delete A/B agent specs

target_roles and target_permissions are additional, optional filters on top of ab:participate.

  • If neither is set, all users with ab:participate are eligible.
  • If only target_roles is set, the user must match at least one listed role.
  • If only target_permissions is set, the user must match at least one listed permission.
  • If both are set, the user must satisfy both filters.

Copy link
Copy Markdown
Collaborator

@haozturk haozturk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @pmlugato looks almost ready. See my inline comments. The restart issue looks major, others are minor.

{%- endfor %}
{%- endfor %}
{%- if services.chat_app.ab_testing is defined and services.chat_app.ab_testing.enabled | default(false) %}
ab_testing:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's one problem I noticed: If I restart the app, the UI created/set variants and the experiment settings are reset, since the YAML config overwrites them. This will become a problem when we create new variants on UI and restart the app. We'll have to need to re-create these variants on every restart.

I can think of the following solution: For A/B testing config, use the YAML values during bootstrap if the A/B testing in the DB is empty. Otherwise, use the values in the DB. Keep an explicit force override mode for cases where you intentionally want YAML to overwrite A/B too. This should be false by default.

In order words, YAML remains authoritative for most static config that can't be edited in the UI. DB becomes authoritative only for UI-managed A/B settings (services.chat_app.ab_testing).

This is introducing an extra complication, but it's necessary if we want to manage these settings over the UI. Let me know what you think.

<div class="provider-row">
<label class="settings-field-label" for="model-select-b">Model</label>
<select id="model-select-b" class="model-select model-select-b"></select>
<p class="settings-description">Move the slider higher to help evaluate more responses, or lower to reduce interruptions. If you leave it alone, the deployment default is used.</p>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word interruptions implies a negative disruption. Use something like to lower for a more standard single-response flow.

return data;
},

/**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Don't know where it's exactly managed, but when the quicker response finishes, its timer doesn't stop. It somehow made me think that "is there more to come?" on that response. I think we should make it stop when the response ends.

| `ab_agents_dir` | string | `/root/archi/ab_agents` | Optional legacy import directory for migrating A/B markdown specs into the DB catalog |
| `sample_rate` | float | `1.0` | Fraction of eligible turns that should run A/B |
| `disclosure_mode` | string | `post_vote_reveal` | One of `blind`, `post_vote_reveal`, `named` |
| `default_trace_mode` | string | `minimal` | One of `minimal`, `normal`, `verbose` |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, the following is the mapping between the config and what's shown in the UI:

  • minimal: Hidden
  • normal: Collapsed
  • verbose: Expanded

What's in the config doesn't make sense to me. I would use in the config what's shown in the UI.

| `enabled` | boolean | `false` | Enable the experiment pool |
| `ab_agents_dir` | string | `/root/archi/ab_agents` | Optional legacy import directory for migrating A/B markdown specs into the DB catalog |
| `sample_rate` | float | `1.0` | Fraction of eligible turns that should run A/B |
| `disclosure_mode` | string | `post_vote_reveal` | One of `blind`, `post_vote_reveal`, `named` |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document the mapping between what's shown in the UI and the config? Also these names aren't very intuitive. Maybe consider renaming.

@pmlugato
Copy link
Copy Markdown
Collaborator

pmlugato commented Apr 3, 2026

@haozturk thanks again for the comments, I agree with them. Any UI changes to A/B testing configuration now persist across restarts. I also followed your suggestion and added a force_yaml_override option should it be needed. Timer is fixed, and naming is clearer and consistent between config/UI/docs. Thanks for the good testing, let me know if you find anything else or if we're good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

update A/B testing

4 participants