Make Step 9 missing-token handling robust in agentic e2e fix#708
Make Step 9 missing-token handling robust in agentic e2e fix#708vishalramvelu wants to merge 7 commits intopromptdriven:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Improves the agentic e2e fix orchestrator’s Step 9 loop-control handling to better tolerate missing/ambiguous control tokens and provider failures, and updates related unit/E2E tests to match the new behavior.
Changes:
- Adds special handling for Step 9 when
_classify_step_output(..., step_num=9)returnsNone, including a one-time Step 9 retry on successful-but-tokenless outputs. - Treats Step 9 provider/timeout failures with no token as transient (start next cycle) rather than terminal.
- Updates unit/E2E assertions for expected call counts and cost with the Step 9 retry.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
pdd/agentic_e2e_fix_orchestrator.py |
Adds Step 9 None-token branching with transient-failure handling and a single retry path. |
tests/test_agentic_e2e_fix_orchestrator.py |
Updates unit test expectations to include a single Step 9 retry (10 calls). |
tests/test_e2e_issue_830_workflow_stall.py |
Updates E2E cost expectation to include Step 9 retry cost. |
Comments suppressed due to low confidence (1)
tests/test_e2e_issue_830_workflow_stall.py:277
- The assertion failure message still says the total cost "exceeds single cycle cost", but
one_cycle_costnow includes a Step 9 retry (i.e., 1 cycle + retry). To avoid confusing failures, update the message (and optionally the variable name) to reflect the new cost model.
# 1 cycle + a single Step 9 retry:
# 8 steps at $0.10 + 2 steps at $0.26 = $1.32
one_cycle_cost = (8 * COST_PER_STEP) + (2 * STEP9_COST)
# Allow small margin for floating point
assert total_cost <= one_cycle_cost + 0.01, (
f"BUG DETECTED (Issue #830): Total cost ${total_cost:.2f} exceeds single cycle "
f"cost ${one_cycle_cost:.2f}. Missing loop control token caused unnecessary "
f"additional cycles, wasting budget."
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
gltanaka
left a comment
There was a problem hiding this comment.
Hey Vishal, thanks for digging into this — the problem is real and I've seen the early terminations you're fixing here. Good catch on distinguishing provider failures from ambiguous outputs. A few things I'd like to see addressed before merging:
Code duplication in the retry path
The retry token-handling block (lines ~1285–1370) is nearly identical to the original token-handling block right above it (ALL_TESTS_PASS verification, MAX_CYCLES_REACHED, CONTINUE_CYCLE). Could you extract the shared logic into a helper function? That way both the initial attempt and retry call the same code path, and we don't have to keep them in sync manually.
--max-turns change is unrelated
The --max-turns 25 addition to Gemini CLI args is a useful change but it's not related to issue #695. Would you mind splitting that into its own PR? Also the comment above it says "15 turns" but the code sets 25 — one of those needs updating.
Silent exception swallowing
The except Exception: pass block around state persistence (around line 1310) could hide real issues. Even a logger.debug(...) there would help with debugging if state persistence ever breaks silently.
Consider the simpler approach
Have you considered just treating None as CONTINUE_CYCLE (i.e., let the next cycle handle it)? That would be a much smaller change and still solves the core problem of not terminating early on transient failures. The single-retry behavior is nice but adds a lot of surface area — curious if you've seen cases where the retry actually recovers vs just letting the next full cycle do it.
Overall the direction is right, just think it can be tightened up. Nice work on the thorough test coverage for the cost model too — that's a detail most people skip. 👍
fix: resolve prompt discovery bugs in pdd update --repo
Just made these changes. I believe the max turns change was from another commit from Serhan but I have removed it. And for the simpler approach, I just looked into it but kept the bounded single-retry strategy since None can represent more than transient provider failures. The current behavior is designed to avoid unnecessary extra cycles/budget burn and a few local tests I ran worked with singular retry. But I'm happy to revisit if you want a follow-up based on trace evidence. |
|
Hey Vishal, we've been reviewing this PR and cherry-picked the commits into a worktree for testing. The mocked unit tests confirm the control flow fix is correct, and all 75/75 cloud batch tests pass with no regressions — so the code looks solid. One thing we noticed: neither the PR description nor issue #695 includes evidence of this bug actually firing in a real workflow. The "Manual Testing" section references pytest runs, but not an actual Do you have any production logs or workflow state files showing a run that terminated with "Workflow stopped: no loop control token in Step 9 output"? Even a single real example would help confirm the fix addresses what's actually happening in practice vs. a theoretical code path. Issue #695 still has "Vishal to add details" in the title with an empty body — any context you can add there would help too. |
|
target 3/24 |
|
Thanks for the thorough review. I actually don’t have production logs or a saved .pdd/e2e-fix-state snapshot from a run that ended with the old generic line Workflow stopped: no loop control token in Step 9 output. The bug was found from code-path review plus flaky local/agentic runs where Step 9 either failed without a token (timeouts / “All agent providers failed”) or returned success text without a parseable control token. What I can point to today is that there is controlled reproduction. Issue #695 describes Case 1 (provider/timeout → non-token error string → _classify_step_output → None → old terminal branch) and Case 2 (successful but non-conforming text → None → same branch). Those match the branches we fixed. And also verification in that mocked unit tests for the Step 9 control flow, plus your 75/75 cloud batch run, to guard regressions. I agree the gap is observability in the wild: if we see another mid-workflow stop, capturing the GitHub state comment (or local e2e-fix-state JSON) + the Step 9 console chunk would be ideal so if it happens again I will record it. And for the 695 issue, I explained in a comment beneath already detailing the issue. |
|
Unit tests fail because of issue 903 tests (unrelated to step 9 changes) |
gltanaka
left a comment
There was a problem hiding this comment.
Branch is behind main — missing the convergence prompt requirements added in #919.
The two failing tests (test_orchestrator_prompt_contains_empty_dev_units_requirement, test_orchestrator_prompt_contains_file_hash_comparison_requirement) were added to main after this branch forked. The tests exist in the merge commit CI runs against, but the orchestrator prompt changes that satisfy them are not on this branch.
Fix: Merge main into fix/695 (or rebase) and push.
gltanaka
left a comment
There was a problem hiding this comment.
Thanks for digging into this, Vishal — the issue is real and the analysis in #695 is solid. The two scenarios you identified (provider failure + missing token) are genuine gaps in the Step 9 handling.
That said, since this PR was originally written, main has changed significantly — the 4-tier control token detection (#865, landed March 18) now covers the most common scenario (Case 2: LLM paraphrases the token). Tiers 2-4 (case-insensitive, semantic regex, LLM classification) catch the vast majority of cases where the model emits something meaningful but not an exact token match.
What's still unaddressed on main (and worth fixing):
-
Step 9 provider/timeout failure (
step_success=False) — When Step 9 itself fails (timeout, providers down), the output is an error string like"Timeout"that won't match any tier. Theelsebranch terminates the workflow. Step 1 already has a timeout retry (lines 1217-1233) and Step 2 has a skip-on-timeout (lines 1249-1254), but Step 9 has no equivalent. -
Tier 4 LLM classification failing silently —
classify_step_outputhas a bareexcept Exception: return None(agentic_common.py:199). If the Gemini Flash call itself times out or rate-limits, the safety net disappears silently. The orchestrator can't distinguish "tier 4 confidently said NONE" from "tier 4 crashed." This compounds with issue 1 — even a successful Step 9 with a paraphrased token can terminate if tiers 1-3 miss and tier 4 is down.
What I'd suggest for a v2:
- Scope the fix to just these two gaps. The 157-line
_handle_step9_control_tokennested function reimplements logic that already works on main (ALL_TESTS_PASS verification, CONTINUE_CYCLE file-hash check, etc.). That duplication makes the code harder to maintain and risks diverging from the existing branches. - For issue 1: a simple check — if
step_success is Falseand_step9_token is None, log a warning and break without settingfinal_message(so the outer loop starts the next cycle). Similar pattern to the Step 1 timeout retry. - For issue 2: have
classify_step_outputreturn a distinct sentinel (or raise) on exception vs. "confidently NONE", so the caller can distinguish the two. Then the Step 9elsebranch can retry or continue the cycle on classification failure, while still terminating on a confident "no token found." - The prompt changes (API mocking best practices in step 11, convergence requirements) look like they belong to #633 — splitting them into their own PR would keep this one focused.
The test updates (adjusting call counts, cost assertions) would naturally follow from the simpler fix.
Appreciate the thorough writeup on the issue and the test coverage here — the core insight is right, just needs to be rebased against what main looks like today.
Made-with: Cursor
Thanks for the review and I agree with the scope. |
|
Greg to review |
gltanaka
left a comment
There was a problem hiding this comment.
Hey Vishal, nice work on the core improvements here — the CLASSIFICATION_ERROR sentinel, using the actual tier-4 token instead of always defaulting to CONTINUE_CYCLE, and the provider-failure handling are all solid changes that should land.
However, the retry block has two functional bugs caused by the duplicated control flow diverging from the original Step 9 path:
1. Missing _handle_verification_failure() on retry verification failure
The original path (line ~1410) calls _handle_verification_failure(verify_output, import_error_retries, console) which detects import errors and sets up verification_failure_context + last_completed_step = 0 for a special retry. The retry path just does last_completed_step = step_num - 1 and falls through — it loses the import error retry logic entirely.
2. Missing _detect_changed_files() check on retry CONTINUE_CYCLE
The original CONTINUE_CYCLE path (line ~1430) checks _detect_changed_files(cwd, cycle_start_hashes) and stops the workflow if no files changed during the cycle (the infinite loop guard). The retry path's CONTINUE_CYCLE branch unconditionally breaks to the next cycle, skipping this guard.
Both of these are the exact kind of divergence that comes from duplicating a complex control flow block. The cleanest fix would be to extract the Step 9 token-handling logic (ALL_TESTS_PASS verification, MAX_CYCLES_REACHED, CONTINUE_CYCLE + file-hash check) into a helper function that both the initial path and retry path call — this was suggested back on March 20 and would prevent these kinds of bugs structurally.
One more small thing: the except Exception: pass around state persistence (~line 1480 in the diff) should at least have a logger.debug(...) so silent state persistence failures are diagnosable.
Made-with: Cursor # Conflicts: # pdd/prompts/agentic_e2e_fix_orchestrator_python.prompt
Ok I refactored Step 9 so the one-shot retry and the first run both go through _resolve_step9_loop_token / _apply_step9_resolved_token, which fixes the missing _handle_verification_failure path and the missing _detect_changed_files guard on retry CONTINUE_CYCLE. Best-effort state persistence after Step 9 retry now logs failures with logger.debug(..., exc_info=True) instead of a bare pass. I also merged main and resolved agentic_e2e_fix_orchestrator_python.prompt by keeping the detailed convergence bullets and the Issue #903 wording. |
Summary
Improves the robustness of Step 9 (final verification / loop control) in the agentic e2e fix workflow. When _classify_step_output(..., step_num=9) returns None, the orchestrator now distinguishes between transient provider failures and ambiguous but successful outputs: failed Step 9 calls with no token start the next cycle instead of stopping mid‑workflow, while successful but tokenless Step 9 responses trigger a single retry. If the retry still does not emit a valid loop‑control token, the workflow stops with an explicit message instead of a vague “no loop control token” warning. This makes loop behavior resilient to provider timeouts and minor format drift, and reduces confusing early terminations.
Test Results
Unit tests (Step 9 loop‑control logic): PASS
E2E tests (Issue 830 scenarios): PASS
Full e2e suite for this issue: PASS
Manual Testing
Ran uv run pytest
Step 9 with no loop‑control token results in one cycle plus a single Step 9 retry, without starting Cycle 2.
Ran uv run pytest tests/test_e2e_issue_830_workflow_stall.py::TestStep9NoneTokenHandling -q to validate:
Successful Step 9 with no token, even after retry, stops with a clear “did not emit a loop control token after retry” message and does not start Cycle 2.
Verified that in “missing token” cost scenario, total_cost reflects a single cycle plus one Step 9 retry, not an extra full cycle.
Checklist
Fixes #695