Make Step 9 missing-token handling robust in agentic e2e fix by vishalramvelu · Pull Request #708 · promptdriven/pdd

vishalramvelu · 2026-03-17T22:20:49Z

Summary

Improves the robustness of Step 9 (final verification / loop control) in the agentic e2e fix workflow. When _classify_step_output(..., step_num=9) returns None, the orchestrator now distinguishes between transient provider failures and ambiguous but successful outputs: failed Step 9 calls with no token start the next cycle instead of stopping mid‑workflow, while successful but tokenless Step 9 responses trigger a single retry. If the retry still does not emit a valid loop‑control token, the workflow stops with an explicit message instead of a vague “no loop control token” warning. This makes loop behavior resilient to provider timeouts and minor format drift, and reduces confusing early terminations.

Test Results

Unit tests (Step 9 loop‑control logic): PASS
E2E tests (Issue 830 scenarios): PASS
Full e2e suite for this issue: PASS

Manual Testing

Ran uv run pytest
Step 9 with no loop‑control token results in one cycle plus a single Step 9 retry, without starting Cycle 2.
Ran uv run pytest tests/test_e2e_issue_830_workflow_stall.py::TestStep9NoneTokenHandling -q to validate:
Successful Step 9 with no token, even after retry, stops with a clear “did not emit a loop control token after retry” message and does not start Cycle 2.
Verified that in “missing token” cost scenario, total_cost reflects a single cycle plus one Step 9 retry, not an extra full cycle.

Checklist

Handled transient Step 9 failures (provider/timeout) by starting the next cycle instead of treating them as terminal.
Added a single, explicit retry for successful Step 9 outputs that lack a parseable loop‑control token.
Updated unit and E2E tests to exercise the new behaviors and cost model (one cycle + retry, no unnecessary second cycle).
Confirmed user‑facing messages are explicit when Step 9 repeatedly fails to emit a valid loop‑control token, avoiding vague “no loop control token” terminations.

Fixes #695

Copilot

Pull request overview

Improves the agentic e2e fix orchestrator’s Step 9 loop-control handling to better tolerate missing/ambiguous control tokens and provider failures, and updates related unit/E2E tests to match the new behavior.

Changes:

Adds special handling for Step 9 when _classify_step_output(..., step_num=9) returns None, including a one-time Step 9 retry on successful-but-tokenless outputs.
Treats Step 9 provider/timeout failures with no token as transient (start next cycle) rather than terminal.
Updates unit/E2E assertions for expected call counts and cost with the Step 9 retry.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`pdd/agentic_e2e_fix_orchestrator.py`	Adds Step 9 `None`-token branching with transient-failure handling and a single retry path.
`tests/test_agentic_e2e_fix_orchestrator.py`	Updates unit test expectations to include a single Step 9 retry (10 calls).
`tests/test_e2e_issue_830_workflow_stall.py`	Updates E2E cost expectation to include Step 9 retry cost.

Comments suppressed due to low confidence (1)

tests/test_e2e_issue_830_workflow_stall.py:277

The assertion failure message still says the total cost "exceeds single cycle cost", but one_cycle_cost now includes a Step 9 retry (i.e., 1 cycle + retry). To avoid confusing failures, update the message (and optionally the variable name) to reflect the new cost model.

        # 1 cycle + a single Step 9 retry:
        # 8 steps at $0.10 + 2 steps at $0.26 = $1.32
        one_cycle_cost = (8 * COST_PER_STEP) + (2 * STEP9_COST)
        # Allow small margin for floating point
        assert total_cost <= one_cycle_cost + 0.01, (
            f"BUG DETECTED (Issue #830): Total cost ${total_cost:.2f} exceeds single cycle "
            f"cost ${one_cycle_cost:.2f}. Missing loop control token caused unnecessary "
            f"additional cycles, wasting budget."

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pdd/agentic_e2e_fix_orchestrator.py

gltanaka

Hey Vishal, thanks for digging into this — the problem is real and I've seen the early terminations you're fixing here. Good catch on distinguishing provider failures from ambiguous outputs. A few things I'd like to see addressed before merging:

Code duplication in the retry path

The retry token-handling block (lines ~1285–1370) is nearly identical to the original token-handling block right above it (ALL_TESTS_PASS verification, MAX_CYCLES_REACHED, CONTINUE_CYCLE). Could you extract the shared logic into a helper function? That way both the initial attempt and retry call the same code path, and we don't have to keep them in sync manually.

--max-turns change is unrelated

The --max-turns 25 addition to Gemini CLI args is a useful change but it's not related to issue #695. Would you mind splitting that into its own PR? Also the comment above it says "15 turns" but the code sets 25 — one of those needs updating.

Silent exception swallowing

The except Exception: pass block around state persistence (around line 1310) could hide real issues. Even a logger.debug(...) there would help with debugging if state persistence ever breaks silently.

Consider the simpler approach

Have you considered just treating None as CONTINUE_CYCLE (i.e., let the next cycle handle it)? That would be a much smaller change and still solves the core problem of not terminating early on transient failures. The single-retry behavior is nice but adds a lot of surface area — curious if you've seen cases where the retry actually recovers vs just letting the next full cycle do it.

Overall the direction is right, just think it can be tightened up. Nice work on the thorough test coverage for the cost model too — that's a detail most people skip. 👍

fix: resolve prompt discovery bugs in pdd update --repo

vishalramvelu · 2026-03-20T04:27:01Z

Hey Vishal, thanks for digging into this — the problem is real and I've seen the early terminations you're fixing here. Good catch on distinguishing provider failures from ambiguous outputs. A few things I'd like to see addressed before merging:

Code duplication in the retry path

The retry token-handling block (lines ~1285–1370) is nearly identical to the original token-handling block right above it (ALL_TESTS_PASS verification, MAX_CYCLES_REACHED, CONTINUE_CYCLE). Could you extract the shared logic into a helper function? That way both the initial attempt and retry call the same code path, and we don't have to keep them in sync manually.

--max-turns change is unrelated

The --max-turns 25 addition to Gemini CLI args is a useful change but it's not related to issue #695. Would you mind splitting that into its own PR? Also the comment above it says "15 turns" but the code sets 25 — one of those needs updating.

Silent exception swallowing

The except Exception: pass block around state persistence (around line 1310) could hide real issues. Even a logger.debug(...) there would help with debugging if state persistence ever breaks silently.

Consider the simpler approach

Have you considered just treating None as CONTINUE_CYCLE (i.e., let the next cycle handle it)? That would be a much smaller change and still solves the core problem of not terminating early on transient failures. The single-retry behavior is nice but adds a lot of surface area — curious if you've seen cases where the retry actually recovers vs just letting the next full cycle do it.

Overall the direction is right, just think it can be tightened up. Nice work on the thorough test coverage for the cost model too — that's a detail most people skip. 👍

Just made these changes. I believe the max turns change was from another commit from Serhan but I have removed it. And for the simpler approach, I just looked into it but kept the bounded single-retry strategy since None can represent more than transient provider failures. The current behavior is designed to avoid unnecessary extra cycles/budget burn and a few local tests I ran worked with singular retry. But I'm happy to revisit if you want a follow-up based on trace evidence.

gltanaka · 2026-03-20T20:37:52Z

Hey Vishal, we've been reviewing this PR and cherry-picked the commits into a worktree for testing. The mocked unit tests confirm the control flow fix is correct, and all 75/75 cloud batch tests pass with no regressions — so the code looks solid.

One thing we noticed: neither the PR description nor issue #695 includes evidence of this bug actually firing in a real workflow. The "Manual Testing" section references pytest runs, but not an actual pdd fix against a real issue.

Do you have any production logs or workflow state files showing a run that terminated with "Workflow stopped: no loop control token in Step 9 output"? Even a single real example would help confirm the fix addresses what's actually happening in practice vs. a theoretical code path. Issue #695 still has "Vishal to add details" in the title with an empty body — any context you can add there would help too.

gltanaka · 2026-03-23T17:18:46Z

target 3/24

vishalramvelu · 2026-03-25T01:04:54Z

#695

Thanks for the thorough review. I actually don’t have production logs or a saved .pdd/e2e-fix-state snapshot from a run that ended with the old generic line Workflow stopped: no loop control token in Step 9 output. The bug was found from code-path review plus flaky local/agentic runs where Step 9 either failed without a token (timeouts / “All agent providers failed”) or returned success text without a parseable control token. What I can point to today is that there is controlled reproduction. Issue #695 describes Case 1 (provider/timeout → non-token error string → _classify_step_output → None → old terminal branch) and Case 2 (successful but non-conforming text → None → same branch). Those match the branches we fixed. And also verification in that mocked unit tests for the Step 9 control flow, plus your 75/75 cloud batch run, to guard regressions. I agree the gap is observability in the wild: if we see another mid-workflow stop, capturing the GitHub state comment (or local e2e-fix-state JSON) + the Step 9 console chunk would be ideal so if it happens again I will record it. And for the 695 issue, I explained in a comment beneath already detailing the issue.

vishalramvelu · 2026-03-25T01:20:36Z

Unit tests fail because of issue 903 tests (unrelated to step 9 changes)

gltanaka

Branch is behind main — missing the convergence prompt requirements added in #919.

The two failing tests (test_orchestrator_prompt_contains_empty_dev_units_requirement, test_orchestrator_prompt_contains_file_hash_comparison_requirement) were added to main after this branch forked. The tests exist in the merge commit CI runs against, but the orchestrator prompt changes that satisfy them are not on this branch.

Fix: Merge main into fix/695 (or rebase) and push.

gltanaka

Thanks for digging into this, Vishal — the issue is real and the analysis in #695 is solid. The two scenarios you identified (provider failure + missing token) are genuine gaps in the Step 9 handling.

That said, since this PR was originally written, main has changed significantly — the 4-tier control token detection (#865, landed March 18) now covers the most common scenario (Case 2: LLM paraphrases the token). Tiers 2-4 (case-insensitive, semantic regex, LLM classification) catch the vast majority of cases where the model emits something meaningful but not an exact token match.

What's still unaddressed on main (and worth fixing):

Step 9 provider/timeout failure (step_success=False) — When Step 9 itself fails (timeout, providers down), the output is an error string like "Timeout" that won't match any tier. The else branch terminates the workflow. Step 1 already has a timeout retry (lines 1217-1233) and Step 2 has a skip-on-timeout (lines 1249-1254), but Step 9 has no equivalent.
Tier 4 LLM classification failing silently — classify_step_output has a bare except Exception: return None (agentic_common.py:199). If the Gemini Flash call itself times out or rate-limits, the safety net disappears silently. The orchestrator can't distinguish "tier 4 confidently said NONE" from "tier 4 crashed." This compounds with issue 1 — even a successful Step 9 with a paraphrased token can terminate if tiers 1-3 miss and tier 4 is down.

What I'd suggest for a v2:

Scope the fix to just these two gaps. The 157-line _handle_step9_control_token nested function reimplements logic that already works on main (ALL_TESTS_PASS verification, CONTINUE_CYCLE file-hash check, etc.). That duplication makes the code harder to maintain and risks diverging from the existing branches.
For issue 1: a simple check — if step_success is False and _step9_token is None, log a warning and break without setting final_message (so the outer loop starts the next cycle). Similar pattern to the Step 1 timeout retry.
For issue 2: have classify_step_output return a distinct sentinel (or raise) on exception vs. "confidently NONE", so the caller can distinguish the two. Then the Step 9 else branch can retry or continue the cycle on classification failure, while still terminating on a confident "no token found."
The prompt changes (API mocking best practices in step 11, convergence requirements) look like they belong to #633 — splitting them into their own PR would keep this one focused.

The test updates (adjusting call counts, cost assertions) would naturally follow from the simpler fix.

Appreciate the thorough writeup on the issue and the test coverage here — the core insight is right, just needs to be rebased against what main looks like today.

Made-with: Cursor

vishalramvelu · 2026-03-27T08:20:21Z

Thanks for digging into this, Vishal — the issue is real and the analysis in #695 is solid. The two scenarios you identified (provider failure + missing token) are genuine gaps in the Step 9 handling.

That said, since this PR was originally written, main has changed significantly — the 4-tier control token detection (#865, landed March 18) now covers the most common scenario (Case 2: LLM paraphrases the token). Tiers 2-4 (case-insensitive, semantic regex, LLM classification) catch the vast majority of cases where the model emits something meaningful but not an exact token match.

What's still unaddressed on main (and worth fixing):

Step 9 provider/timeout failure (step_success=False) — When Step 9 itself fails (timeout, providers down), the output is an error string like "Timeout" that won't match any tier. The else branch terminates the workflow. Step 1 already has a timeout retry (lines 1217-1233) and Step 2 has a skip-on-timeout (lines 1249-1254), but Step 9 has no equivalent.

Tier 4 LLM classification failing silently — classify_step_output has a bare except Exception: return None (agentic_common.py:199). If the Gemini Flash call itself times out or rate-limits, the safety net disappears silently. The orchestrator can't distinguish "tier 4 confidently said NONE" from "tier 4 crashed." This compounds with issue 1 — even a successful Step 9 with a paraphrased token can terminate if tiers 1-3 miss and tier 4 is down.

What I'd suggest for a v2:

Scope the fix to just these two gaps. The 157-line _handle_step9_control_token nested function reimplements logic that already works on main (ALL_TESTS_PASS verification, CONTINUE_CYCLE file-hash check, etc.). That duplication makes the code harder to maintain and risks diverging from the existing branches.

For issue 1: a simple check — if step_success is False and _step9_token is None, log a warning and break without setting final_message (so the outer loop starts the next cycle). Similar pattern to the Step 1 timeout retry.

For issue 2: have classify_step_output return a distinct sentinel (or raise) on exception vs. "confidently NONE", so the caller can distinguish the two. Then the Step 9 else branch can retry or continue the cycle on classification failure, while still terminating on a confident "no token found."

The prompt changes (API mocking best practices in step 11, convergence requirements) look like they belong to create a shared document which pdd can reference so it makes right calls #633 — splitting them into their own PR would keep this one focused.

The test updates (adjusting call counts, cost assertions) would naturally follow from the simpler fix.

Appreciate the thorough writeup on the issue and the test coverage here — the core insight is right, just needs to be rebased against what main looks like today.

Thanks for the review and I agree with the scope.
I just reworked this PR to only address the two gaps on current main:
Step 9 provider/timeout failure: if step_success=False and no token, treat as transient and continue next cycle (no terminal stop).
Tier-4 classifier failure vs confident NONE: return a distinct classification-error signal so Step 9 can handle “classifier down” separately from true “no token.”

gltanaka · 2026-03-30T17:23:52Z

Greg to review

gltanaka

Hey Vishal, nice work on the core improvements here — the CLASSIFICATION_ERROR sentinel, using the actual tier-4 token instead of always defaulting to CONTINUE_CYCLE, and the provider-failure handling are all solid changes that should land.

However, the retry block has two functional bugs caused by the duplicated control flow diverging from the original Step 9 path:

1. Missing _handle_verification_failure() on retry verification failure

The original path (line ~1410) calls _handle_verification_failure(verify_output, import_error_retries, console) which detects import errors and sets up verification_failure_context + last_completed_step = 0 for a special retry. The retry path just does last_completed_step = step_num - 1 and falls through — it loses the import error retry logic entirely.

2. Missing _detect_changed_files() check on retry CONTINUE_CYCLE

The original CONTINUE_CYCLE path (line ~1430) checks _detect_changed_files(cwd, cycle_start_hashes) and stops the workflow if no files changed during the cycle (the infinite loop guard). The retry path's CONTINUE_CYCLE branch unconditionally breaks to the next cycle, skipping this guard.

Both of these are the exact kind of divergence that comes from duplicating a complex control flow block. The cleanest fix would be to extract the Step 9 token-handling logic (ALL_TESTS_PASS verification, MAX_CYCLES_REACHED, CONTINUE_CYCLE + file-hash check) into a helper function that both the initial path and retry path call — this was suggested back on March 20 and would prevent these kinds of bugs structurally.

One more small thing: the except Exception: pass around state persistence (~line 1480 in the diff) should at least have a logger.debug(...) so silent state persistence failures are diagnosable.

Made-with: Cursor # Conflicts: # pdd/prompts/agentic_e2e_fix_orchestrator_python.prompt

vishalramvelu · 2026-04-03T22:04:14Z

Hey Vishal, nice work on the core improvements here — the CLASSIFICATION_ERROR sentinel, using the actual tier-4 token instead of always defaulting to CONTINUE_CYCLE, and the provider-failure handling are all solid changes that should land.

However, the retry block has two functional bugs caused by the duplicated control flow diverging from the original Step 9 path:

1. Missing _handle_verification_failure() on retry verification failure

The original path (line ~1410) calls _handle_verification_failure(verify_output, import_error_retries, console) which detects import errors and sets up verification_failure_context + last_completed_step = 0 for a special retry. The retry path just does last_completed_step = step_num - 1 and falls through — it loses the import error retry logic entirely.

2. Missing _detect_changed_files() check on retry CONTINUE_CYCLE

The original CONTINUE_CYCLE path (line ~1430) checks _detect_changed_files(cwd, cycle_start_hashes) and stops the workflow if no files changed during the cycle (the infinite loop guard). The retry path's CONTINUE_CYCLE branch unconditionally breaks to the next cycle, skipping this guard.

Both of these are the exact kind of divergence that comes from duplicating a complex control flow block. The cleanest fix would be to extract the Step 9 token-handling logic (ALL_TESTS_PASS verification, MAX_CYCLES_REACHED, CONTINUE_CYCLE + file-hash check) into a helper function that both the initial path and retry path call — this was suggested back on March 20 and would prevent these kinds of bugs structurally.

One more small thing: the except Exception: pass around state persistence (~line 1480 in the diff) should at least have a logger.debug(...) so silent state persistence failures are diagnosable.

Ok I refactored Step 9 so the one-shot retry and the first run both go through _resolve_step9_loop_token / _apply_step9_resolved_token, which fixes the missing _handle_verification_failure path and the missing _detect_changed_files guard on retry CONTINUE_CYCLE. Best-effort state persistence after Step 9 retry now logs failures with logger.debug(..., exc_info=True) instead of a bare pass. I also merged main and resolved agentic_e2e_fix_orchestrator_python.prompt by keeping the detailed convergence bullets and the Issue #903 wording.

gltanaka requested a review from Copilot March 18, 2026 17:10

Copilot started reviewing on behalf of gltanaka March 18, 2026 17:11 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

pdd/agentic_e2e_fix_orchestrator.py Show resolved Hide resolved

pdd/agentic_e2e_fix_orchestrator.py Outdated Show resolved Hide resolved

pdd/agentic_e2e_fix_orchestrator.py Outdated Show resolved Hide resolved

gltanaka requested changes Mar 19, 2026

View reviewed changes

Serhan-Asad pushed a commit that referenced this pull request Mar 20, 2026

Merge pull request #708 from gltanaka/global-sync

46bedc3

fix: resolve prompt discovery bugs in pdd update --repo

vishalramvelu force-pushed the fix/695 branch from dc8b366 to ad63dfb Compare March 20, 2026 04:12

gltanaka requested changes Mar 25, 2026

View reviewed changes

gltanaka requested changes Mar 26, 2026

View reviewed changes

vishalramvelu added 5 commits March 27, 2026 03:07

Make Step 9 missing-token handling robust in agentic e2e fix

d438236

Persist Step 9 retry state and improve transient retry handling

0be23ba

Add convergence requirements to e2e fix orchestrator prompt

747c2e8

Harden Step 9 token handling for provider and classifier failures

b003d10

Fix Step 9 logic after rebase conflict

3b6ec8e

Made-with: Cursor

vishalramvelu force-pushed the fix/695 branch from fb2ec3a to 3b6ec8e Compare March 27, 2026 08:13

gltanaka requested changes Mar 31, 2026

View reviewed changes

vishalramvelu added 2 commits April 3, 2026 16:56

fix(e2e): unify Step 9 retry with primary

f43f17c

Merge origin/main into fix/695

e2eede5

Made-with: Cursor # Conflicts: # pdd/prompts/agentic_e2e_fix_orchestrator_python.prompt

Conversation

vishalramvelu commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Results

Manual Testing

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gltanaka left a comment

Choose a reason for hiding this comment

Uh oh!

vishalramvelu commented Mar 20, 2026

Uh oh!

gltanaka commented Mar 20, 2026

Uh oh!

gltanaka commented Mar 23, 2026

Uh oh!

vishalramvelu commented Mar 25, 2026

Uh oh!

vishalramvelu commented Mar 25, 2026

Uh oh!

gltanaka left a comment

Choose a reason for hiding this comment

Uh oh!

gltanaka left a comment

Choose a reason for hiding this comment

Uh oh!

vishalramvelu commented Mar 27, 2026

Uh oh!

gltanaka commented Mar 30, 2026

Uh oh!

gltanaka left a comment

Choose a reason for hiding this comment

Uh oh!

vishalramvelu commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vishalramvelu commented Mar 17, 2026 •

edited

Loading