Skip to content

Make Step 9 missing-token handling robust in agentic e2e fix#708

Open
vishalramvelu wants to merge 7 commits intopromptdriven:mainfrom
vishalramvelu:fix/695
Open

Make Step 9 missing-token handling robust in agentic e2e fix#708
vishalramvelu wants to merge 7 commits intopromptdriven:mainfrom
vishalramvelu:fix/695

Conversation

@vishalramvelu
Copy link
Copy Markdown
Contributor

@vishalramvelu vishalramvelu commented Mar 17, 2026

Summary

Improves the robustness of Step 9 (final verification / loop control) in the agentic e2e fix workflow. When _classify_step_output(..., step_num=9) returns None, the orchestrator now distinguishes between transient provider failures and ambiguous but successful outputs: failed Step 9 calls with no token start the next cycle instead of stopping mid‑workflow, while successful but tokenless Step 9 responses trigger a single retry. If the retry still does not emit a valid loop‑control token, the workflow stops with an explicit message instead of a vague “no loop control token” warning. This makes loop behavior resilient to provider timeouts and minor format drift, and reduces confusing early terminations.

Test Results

Unit tests (Step 9 loop‑control logic): PASS
E2E tests (Issue 830 scenarios): PASS
Full e2e suite for this issue: PASS

Manual Testing

Ran uv run pytest
Step 9 with no loop‑control token results in one cycle plus a single Step 9 retry, without starting Cycle 2.
Ran uv run pytest tests/test_e2e_issue_830_workflow_stall.py::TestStep9NoneTokenHandling -q to validate:
Successful Step 9 with no token, even after retry, stops with a clear “did not emit a loop control token after retry” message and does not start Cycle 2.
Verified that in “missing token” cost scenario, total_cost reflects a single cycle plus one Step 9 retry, not an extra full cycle.

Checklist

  • Handled transient Step 9 failures (provider/timeout) by starting the next cycle instead of treating them as terminal.
  • Added a single, explicit retry for successful Step 9 outputs that lack a parseable loop‑control token.
  • Updated unit and E2E tests to exercise the new behaviors and cost model (one cycle + retry, no unnecessary second cycle).
  • Confirmed user‑facing messages are explicit when Step 9 repeatedly fails to emit a valid loop‑control token, avoiding vague “no loop control token” terminations.

Fixes #695

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the agentic e2e fix orchestrator’s Step 9 loop-control handling to better tolerate missing/ambiguous control tokens and provider failures, and updates related unit/E2E tests to match the new behavior.

Changes:

  • Adds special handling for Step 9 when _classify_step_output(..., step_num=9) returns None, including a one-time Step 9 retry on successful-but-tokenless outputs.
  • Treats Step 9 provider/timeout failures with no token as transient (start next cycle) rather than terminal.
  • Updates unit/E2E assertions for expected call counts and cost with the Step 9 retry.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
pdd/agentic_e2e_fix_orchestrator.py Adds Step 9 None-token branching with transient-failure handling and a single retry path.
tests/test_agentic_e2e_fix_orchestrator.py Updates unit test expectations to include a single Step 9 retry (10 calls).
tests/test_e2e_issue_830_workflow_stall.py Updates E2E cost expectation to include Step 9 retry cost.
Comments suppressed due to low confidence (1)

tests/test_e2e_issue_830_workflow_stall.py:277

  • The assertion failure message still says the total cost "exceeds single cycle cost", but one_cycle_cost now includes a Step 9 retry (i.e., 1 cycle + retry). To avoid confusing failures, update the message (and optionally the variable name) to reflect the new cost model.
        # 1 cycle + a single Step 9 retry:
        # 8 steps at $0.10 + 2 steps at $0.26 = $1.32
        one_cycle_cost = (8 * COST_PER_STEP) + (2 * STEP9_COST)
        # Allow small margin for floating point
        assert total_cost <= one_cycle_cost + 0.01, (
            f"BUG DETECTED (Issue #830): Total cost ${total_cost:.2f} exceeds single cycle "
            f"cost ${one_cycle_cost:.2f}. Missing loop control token caused unnecessary "
            f"additional cycles, wasting budget."

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@gltanaka gltanaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Vishal, thanks for digging into this — the problem is real and I've seen the early terminations you're fixing here. Good catch on distinguishing provider failures from ambiguous outputs. A few things I'd like to see addressed before merging:

Code duplication in the retry path

The retry token-handling block (lines ~1285–1370) is nearly identical to the original token-handling block right above it (ALL_TESTS_PASS verification, MAX_CYCLES_REACHED, CONTINUE_CYCLE). Could you extract the shared logic into a helper function? That way both the initial attempt and retry call the same code path, and we don't have to keep them in sync manually.

--max-turns change is unrelated

The --max-turns 25 addition to Gemini CLI args is a useful change but it's not related to issue #695. Would you mind splitting that into its own PR? Also the comment above it says "15 turns" but the code sets 25 — one of those needs updating.

Silent exception swallowing

The except Exception: pass block around state persistence (around line 1310) could hide real issues. Even a logger.debug(...) there would help with debugging if state persistence ever breaks silently.

Consider the simpler approach

Have you considered just treating None as CONTINUE_CYCLE (i.e., let the next cycle handle it)? That would be a much smaller change and still solves the core problem of not terminating early on transient failures. The single-retry behavior is nice but adds a lot of surface area — curious if you've seen cases where the retry actually recovers vs just letting the next full cycle do it.

Overall the direction is right, just think it can be tightened up. Nice work on the thorough test coverage for the cost model too — that's a detail most people skip. 👍

Serhan-Asad pushed a commit that referenced this pull request Mar 20, 2026
fix: resolve prompt discovery bugs in pdd update --repo
@vishalramvelu
Copy link
Copy Markdown
Contributor Author

Hey Vishal, thanks for digging into this — the problem is real and I've seen the early terminations you're fixing here. Good catch on distinguishing provider failures from ambiguous outputs. A few things I'd like to see addressed before merging:

Code duplication in the retry path

The retry token-handling block (lines ~1285–1370) is nearly identical to the original token-handling block right above it (ALL_TESTS_PASS verification, MAX_CYCLES_REACHED, CONTINUE_CYCLE). Could you extract the shared logic into a helper function? That way both the initial attempt and retry call the same code path, and we don't have to keep them in sync manually.

--max-turns change is unrelated

The --max-turns 25 addition to Gemini CLI args is a useful change but it's not related to issue #695. Would you mind splitting that into its own PR? Also the comment above it says "15 turns" but the code sets 25 — one of those needs updating.

Silent exception swallowing

The except Exception: pass block around state persistence (around line 1310) could hide real issues. Even a logger.debug(...) there would help with debugging if state persistence ever breaks silently.

Consider the simpler approach

Have you considered just treating None as CONTINUE_CYCLE (i.e., let the next cycle handle it)? That would be a much smaller change and still solves the core problem of not terminating early on transient failures. The single-retry behavior is nice but adds a lot of surface area — curious if you've seen cases where the retry actually recovers vs just letting the next full cycle do it.

Overall the direction is right, just think it can be tightened up. Nice work on the thorough test coverage for the cost model too — that's a detail most people skip. 👍

Just made these changes. I believe the max turns change was from another commit from Serhan but I have removed it. And for the simpler approach, I just looked into it but kept the bounded single-retry strategy since None can represent more than transient provider failures. The current behavior is designed to avoid unnecessary extra cycles/budget burn and a few local tests I ran worked with singular retry. But I'm happy to revisit if you want a follow-up based on trace evidence.

@gltanaka
Copy link
Copy Markdown
Contributor

Hey Vishal, we've been reviewing this PR and cherry-picked the commits into a worktree for testing. The mocked unit tests confirm the control flow fix is correct, and all 75/75 cloud batch tests pass with no regressions — so the code looks solid.

One thing we noticed: neither the PR description nor issue #695 includes evidence of this bug actually firing in a real workflow. The "Manual Testing" section references pytest runs, but not an actual pdd fix against a real issue.

Do you have any production logs or workflow state files showing a run that terminated with "Workflow stopped: no loop control token in Step 9 output"? Even a single real example would help confirm the fix addresses what's actually happening in practice vs. a theoretical code path. Issue #695 still has "Vishal to add details" in the title with an empty body — any context you can add there would help too.

@gltanaka
Copy link
Copy Markdown
Contributor

target 3/24

@vishalramvelu
Copy link
Copy Markdown
Contributor Author

#695

Thanks for the thorough review. I actually don’t have production logs or a saved .pdd/e2e-fix-state snapshot from a run that ended with the old generic line Workflow stopped: no loop control token in Step 9 output. The bug was found from code-path review plus flaky local/agentic runs where Step 9 either failed without a token (timeouts / “All agent providers failed”) or returned success text without a parseable control token. What I can point to today is that there is controlled reproduction. Issue #695 describes Case 1 (provider/timeout → non-token error string → _classify_step_output → None → old terminal branch) and Case 2 (successful but non-conforming text → None → same branch). Those match the branches we fixed. And also verification in that mocked unit tests for the Step 9 control flow, plus your 75/75 cloud batch run, to guard regressions. I agree the gap is observability in the wild: if we see another mid-workflow stop, capturing the GitHub state comment (or local e2e-fix-state JSON) + the Step 9 console chunk would be ideal so if it happens again I will record it. And for the 695 issue, I explained in a comment beneath already detailing the issue.

@vishalramvelu
Copy link
Copy Markdown
Contributor Author

Unit tests fail because of issue 903 tests (unrelated to step 9 changes)

Copy link
Copy Markdown
Contributor

@gltanaka gltanaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Branch is behind main — missing the convergence prompt requirements added in #919.

The two failing tests (test_orchestrator_prompt_contains_empty_dev_units_requirement, test_orchestrator_prompt_contains_file_hash_comparison_requirement) were added to main after this branch forked. The tests exist in the merge commit CI runs against, but the orchestrator prompt changes that satisfy them are not on this branch.

Fix: Merge main into fix/695 (or rebase) and push.

Copy link
Copy Markdown
Contributor

@gltanaka gltanaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this, Vishal — the issue is real and the analysis in #695 is solid. The two scenarios you identified (provider failure + missing token) are genuine gaps in the Step 9 handling.

That said, since this PR was originally written, main has changed significantly — the 4-tier control token detection (#865, landed March 18) now covers the most common scenario (Case 2: LLM paraphrases the token). Tiers 2-4 (case-insensitive, semantic regex, LLM classification) catch the vast majority of cases where the model emits something meaningful but not an exact token match.

What's still unaddressed on main (and worth fixing):

  1. Step 9 provider/timeout failure (step_success=False) — When Step 9 itself fails (timeout, providers down), the output is an error string like "Timeout" that won't match any tier. The else branch terminates the workflow. Step 1 already has a timeout retry (lines 1217-1233) and Step 2 has a skip-on-timeout (lines 1249-1254), but Step 9 has no equivalent.

  2. Tier 4 LLM classification failing silentlyclassify_step_output has a bare except Exception: return None (agentic_common.py:199). If the Gemini Flash call itself times out or rate-limits, the safety net disappears silently. The orchestrator can't distinguish "tier 4 confidently said NONE" from "tier 4 crashed." This compounds with issue 1 — even a successful Step 9 with a paraphrased token can terminate if tiers 1-3 miss and tier 4 is down.

What I'd suggest for a v2:

  • Scope the fix to just these two gaps. The 157-line _handle_step9_control_token nested function reimplements logic that already works on main (ALL_TESTS_PASS verification, CONTINUE_CYCLE file-hash check, etc.). That duplication makes the code harder to maintain and risks diverging from the existing branches.
  • For issue 1: a simple check — if step_success is False and _step9_token is None, log a warning and break without setting final_message (so the outer loop starts the next cycle). Similar pattern to the Step 1 timeout retry.
  • For issue 2: have classify_step_output return a distinct sentinel (or raise) on exception vs. "confidently NONE", so the caller can distinguish the two. Then the Step 9 else branch can retry or continue the cycle on classification failure, while still terminating on a confident "no token found."
  • The prompt changes (API mocking best practices in step 11, convergence requirements) look like they belong to #633 — splitting them into their own PR would keep this one focused.

The test updates (adjusting call counts, cost assertions) would naturally follow from the simpler fix.

Appreciate the thorough writeup on the issue and the test coverage here — the core insight is right, just needs to be rebased against what main looks like today.

@vishalramvelu
Copy link
Copy Markdown
Contributor Author

Thanks for digging into this, Vishal — the issue is real and the analysis in #695 is solid. The two scenarios you identified (provider failure + missing token) are genuine gaps in the Step 9 handling.

That said, since this PR was originally written, main has changed significantly — the 4-tier control token detection (#865, landed March 18) now covers the most common scenario (Case 2: LLM paraphrases the token). Tiers 2-4 (case-insensitive, semantic regex, LLM classification) catch the vast majority of cases where the model emits something meaningful but not an exact token match.

What's still unaddressed on main (and worth fixing):

  1. Step 9 provider/timeout failure (step_success=False) — When Step 9 itself fails (timeout, providers down), the output is an error string like "Timeout" that won't match any tier. The else branch terminates the workflow. Step 1 already has a timeout retry (lines 1217-1233) and Step 2 has a skip-on-timeout (lines 1249-1254), but Step 9 has no equivalent.
  2. Tier 4 LLM classification failing silentlyclassify_step_output has a bare except Exception: return None (agentic_common.py:199). If the Gemini Flash call itself times out or rate-limits, the safety net disappears silently. The orchestrator can't distinguish "tier 4 confidently said NONE" from "tier 4 crashed." This compounds with issue 1 — even a successful Step 9 with a paraphrased token can terminate if tiers 1-3 miss and tier 4 is down.

What I'd suggest for a v2:

  • Scope the fix to just these two gaps. The 157-line _handle_step9_control_token nested function reimplements logic that already works on main (ALL_TESTS_PASS verification, CONTINUE_CYCLE file-hash check, etc.). That duplication makes the code harder to maintain and risks diverging from the existing branches.
  • For issue 1: a simple check — if step_success is False and _step9_token is None, log a warning and break without setting final_message (so the outer loop starts the next cycle). Similar pattern to the Step 1 timeout retry.
  • For issue 2: have classify_step_output return a distinct sentinel (or raise) on exception vs. "confidently NONE", so the caller can distinguish the two. Then the Step 9 else branch can retry or continue the cycle on classification failure, while still terminating on a confident "no token found."
  • The prompt changes (API mocking best practices in step 11, convergence requirements) look like they belong to create a shared document which pdd can reference so it makes right calls #633 — splitting them into their own PR would keep this one focused.

The test updates (adjusting call counts, cost assertions) would naturally follow from the simpler fix.

Appreciate the thorough writeup on the issue and the test coverage here — the core insight is right, just needs to be rebased against what main looks like today.

Thanks for the review and I agree with the scope.
I just reworked this PR to only address the two gaps on current main:
Step 9 provider/timeout failure: if step_success=False and no token, treat as transient and continue next cycle (no terminal stop).
Tier-4 classifier failure vs confident NONE: return a distinct classification-error signal so Step 9 can handle “classifier down” separately from true “no token.”

@gltanaka
Copy link
Copy Markdown
Contributor

Greg to review

Copy link
Copy Markdown
Contributor

@gltanaka gltanaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Vishal, nice work on the core improvements here — the CLASSIFICATION_ERROR sentinel, using the actual tier-4 token instead of always defaulting to CONTINUE_CYCLE, and the provider-failure handling are all solid changes that should land.

However, the retry block has two functional bugs caused by the duplicated control flow diverging from the original Step 9 path:

1. Missing _handle_verification_failure() on retry verification failure

The original path (line ~1410) calls _handle_verification_failure(verify_output, import_error_retries, console) which detects import errors and sets up verification_failure_context + last_completed_step = 0 for a special retry. The retry path just does last_completed_step = step_num - 1 and falls through — it loses the import error retry logic entirely.

2. Missing _detect_changed_files() check on retry CONTINUE_CYCLE

The original CONTINUE_CYCLE path (line ~1430) checks _detect_changed_files(cwd, cycle_start_hashes) and stops the workflow if no files changed during the cycle (the infinite loop guard). The retry path's CONTINUE_CYCLE branch unconditionally breaks to the next cycle, skipping this guard.

Both of these are the exact kind of divergence that comes from duplicating a complex control flow block. The cleanest fix would be to extract the Step 9 token-handling logic (ALL_TESTS_PASS verification, MAX_CYCLES_REACHED, CONTINUE_CYCLE + file-hash check) into a helper function that both the initial path and retry path call — this was suggested back on March 20 and would prevent these kinds of bugs structurally.

One more small thing: the except Exception: pass around state persistence (~line 1480 in the diff) should at least have a logger.debug(...) so silent state persistence failures are diagnosable.

Made-with: Cursor

# Conflicts:
#	pdd/prompts/agentic_e2e_fix_orchestrator_python.prompt
@vishalramvelu
Copy link
Copy Markdown
Contributor Author

Hey Vishal, nice work on the core improvements here — the CLASSIFICATION_ERROR sentinel, using the actual tier-4 token instead of always defaulting to CONTINUE_CYCLE, and the provider-failure handling are all solid changes that should land.

However, the retry block has two functional bugs caused by the duplicated control flow diverging from the original Step 9 path:

1. Missing _handle_verification_failure() on retry verification failure

The original path (line ~1410) calls _handle_verification_failure(verify_output, import_error_retries, console) which detects import errors and sets up verification_failure_context + last_completed_step = 0 for a special retry. The retry path just does last_completed_step = step_num - 1 and falls through — it loses the import error retry logic entirely.

2. Missing _detect_changed_files() check on retry CONTINUE_CYCLE

The original CONTINUE_CYCLE path (line ~1430) checks _detect_changed_files(cwd, cycle_start_hashes) and stops the workflow if no files changed during the cycle (the infinite loop guard). The retry path's CONTINUE_CYCLE branch unconditionally breaks to the next cycle, skipping this guard.

Both of these are the exact kind of divergence that comes from duplicating a complex control flow block. The cleanest fix would be to extract the Step 9 token-handling logic (ALL_TESTS_PASS verification, MAX_CYCLES_REACHED, CONTINUE_CYCLE + file-hash check) into a helper function that both the initial path and retry path call — this was suggested back on March 20 and would prevent these kinds of bugs structurally.

One more small thing: the except Exception: pass around state persistence (~line 1480 in the diff) should at least have a logger.debug(...) so silent state persistence failures are diagnosable.

Ok I refactored Step 9 so the one-shot retry and the first run both go through _resolve_step9_loop_token / _apply_step9_resolved_token, which fixes the missing _handle_verification_failure path and the missing _detect_changed_files guard on retry CONTINUE_CYCLE. Best-effort state persistence after Step 9 retry now logs failures with logger.debug(..., exc_info=True) instead of a bare pass. I also merged main and resolved agentic_e2e_fix_orchestrator_python.prompt by keeping the detailed convergence bullets and the Issue #903 wording.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pdd fix stops midstream (Vishal to add details)

3 participants