Improve one-session sync: better quality, speed, and cost#713
Improve one-session sync: better quality, speed, and cost#713gltanaka merged 23 commits intopromptdriven:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates PDD’s one-session sync “mega-prompt” to be substantially shorter while aiming to improve result quality, speed, and cost, and documents the --one-session CLI option in the README.
Changes:
- Replaced
pdd/prompts/one_session_agent_LLM.promptwith a redesigned, shorter one-session sync instruction template. - Documented
--one-session / --no-one-sessionbehavior and defaults inREADME.md.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
pdd/prompts/one_session_agent_LLM.prompt |
Rewrites the one-session agent prompt (example → crash-fix → verify → tests) with new goals/criteria and condensed guidance. |
README.md |
Adds CLI option documentation and explains one-session mode defaults and usage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
auto eval (LLM as a judge) target 3/30 |
|
Hi Niti! The 14 CI failures you're seeing were already fixed on That should clear all 14 failures on re-run. |
|
Target: 4/6 |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- workflow_engine_python.prompt is a specification for a DAG workflow engine that figures out the right order to run tasks in - workflow_engine.py is an implementation with 3 bugs: rollback order is alphabetical descending sort, not a reversal; get_execution_plan mutates state, and cache uses id() for hashing rather than content.
- one_session_agent_LLM_improved.prompt is the version I am iterating on.
- Analyzed a previous run of original one-session sync on the insert_includes module. - Based on the original prompt's drawbacks in that case study, made the following improvements to the new version: - New "Test plan" paragraph: Instructs the agent to use Step 3's requirement list as a checklist, and explicitly add entries for parameter forwarding, branch coverage, return shape, and error cases — before writing any tests. - Two new bullet points in the test design principle (lines 73-74): parameter forwarding verification and branch coverage. - New organization line (line 76): Group tests by requirement, keep imports clean. - Reworded mock guidance (line 80): "Mock the function's direct dependencies (other modules it calls), not its own internal logic" — replaces the ambiguous "do not mock internal implementation details or functions."
- Added paragraph requiring labeled sections per feature in the example, with verification that each demo actually exercises its claimed behavior (the broken cache demo fix)
- Step 2 success now requires "visible output" not just exit code 0 (catches silent no-op exits)
- Added "wrong boolean gating" and "does each branch fire only when it should?" to the latent bug checklist (catches the elif verbose bug class)
- Changed "write this list out" to "print this numbered list to stdout" (forces externalization of requirement list)
- Added compound condition boundary testing (each sub-condition failing independently)
- Added boolean parameter False-path testing (verbose=False produces no side effects)
- Added step name reference table
- Explicit "Never modify the prompt file — it is read-only" prohibition
- Added {basename} and {language} to the files block
- Added tmp_path rule for filesystem test isolation
- Restated file modification permissions for the test-fix loop
- For examples: Removed "demonstrates every requirement" success criterion → replaced with "someone can fully understand how to use the module from reading it" - For tests: Added Z3 formal verification: "For each edge case, decide whether it is better tested with Z3 formal verification or a unit test" (PDD convention) - Test plan written as comment block at top of test file instead of just stdout (matches generate_test_LLM.prompt's "first part of the test file should be the detailed test plan in comments")
- Improve test quality: require value-checking assertions instead of just type/existence checks, and build test plan from spec requirements before writing tests - Prevent common failures: verify imports actually exist, ban silent fallback stubs, always re-run after fixes - Clarify fix logic: spec is the authority for whether to fix code or tests, don't proceed after crash-fix failure
- Ensure test behavior is the same as the original prompt (only create/modify new tests, not old tests) - Restructure points for clarity
- Warn against \n in string literals (file-writing tools mangle them, causing 5+ fix loops) - Add regression test entries to test plan for bugs fixed during Steps 2-3 (plan already required tests per requirement, but fixed bugs weren't explicitly called out as needing their own dedicated regression test) - Final gate now checks test plan completeness — every plan entry must have a corresponding test before reporting done
Remove one_session_agent_LLM_improved.prompt, one_session_agent_LLM_original.prompt, workflow_engine_python.prompt, and workflow_engine.py to keep branch focused on issue-681 changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
97794ce to
6d19a0a
Compare
- tests/test_one_session_eval.py -new scenarios in tests/fixtures/one_session_eval
Fixes:
- Run both prompt variants in parallel using ThreadPoolExecutor instead of sequentially (halves wall-clock time per scenario)
- Replace in-place prompt file swapping with PDD_PATH override dirs so parallel runs don't race on a shared file
- Set PDD_SKIP_UPDATE_CHECK=1 and stdin=DEVNULL to prevent the pdd CLI from blocking on interactive upgrade prompts
- Switch judge prompt construction from str.format() to str.replace() and use llm_invoke(messages=) to avoid KeyError on curly braces in Python code, diffs, and error messages (e.g. {token}, {name})
- Send unified diffs + new files to the judge instead of full file contents, so large modules (130KB+) don't get truncated before the judge can see the test files
- Run the LLM judge even when a variant times out, so the output includes analysis of what went wrong
- Increase default timeout from 300s to 480s (720s for large scenarios)
Prompt improvements:
- Split test-writing instructions into two paths: "if test file exists" preserves existing tests and matches their style; "if no test file" writes from scratch with class-per-requirement structure
- Update scenario descriptions (S1, S3, S4, S7) to clarify that adding new tests is acceptable — only deleting/rewriting existing passing tests should be penalized
- Improve output formatting - Redefine LLM judge prompt to ignore cost and speed when determining the winner prompt
- That folder contains test files that are designed to fail, they are purposely written poorly for one-session sync evaluation
- The pytest.ini approach from earlier didn't work for deeply nested paths, so I removed that and used a conftest instead.
|
Greg to review |
gltanaka
left a comment
There was a problem hiding this comment.
Comprehensive Review of PR #713
Niti, this is excellent work — the research-driven approach, the thorough evaluation methodology, and the quality of the writeup are all impressive. This is exactly the kind of rigorous prompt engineering contribution we want to see. Thank you for investing the time to do this properly.
I've done an in-depth analysis of the old vs new prompt, tested the new prompt live on two scenarios, and checked alignment against both our internal prompting guide and Anthropic's latest prompting best practices. Here's the full picture.
Live Testing Results
I cherry-picked the PR onto gltanaka/pdd main and ran two scenarios with the new prompt using Claude as the agent (not Gemini):
Scenario A — Expand Tests (S4 equivalent: stack_calculator)
- Starting state: 3 basic tests, no example
- Result: ✅ Success in 2m21s, $0.74
- Generated 26 tests (from 3), all passing
- Test plan written as comment block at top of test file ✅
- Tests grouped by requirement (10 classes covering all 6 spec requirements)
- Strong value-checking assertions (
== 5.0,== 64.0), not just type checks - Original 3 tests preserved untouched ✅
- Clean example generated with error handling demos
Scenario B — Greenfield (S2 equivalent: unit_converter)
- Starting state: Code only, no tests, no example
- Result: ✅ Success in 1m46s, $0.59
- Generated 26 tests, all passing
- Test plan as comment block mapping each test to spec requirement numbers ✅
- Uses
pytest.approxfor floating-point comparisons in round-trip tests - Clean example with
sys.pathsetup for portability
Both completed on the first pass with no crash-fix or test-fix loops. This also provides some evidence that the prompt works across models (Gemini in your eval, Claude in mine).
Existing test suite: 3,439 passed, 20 skipped, 1 failed (pre-existing agentic_sync.py bug on main, unrelated to this PR).
Alignment with Internal Prompting Guide (docs/prompting_guide.md)
| Guide Principle | Old Prompt | New Prompt |
|---|---|---|
| Positive over negative constraints | Poor (many DO NOT/NEVER) | ✅ Good (mostly positive framing) |
| Prompt brevity / ratio | Poor (440 lines) | ✅ Better (139 lines) |
| Don't include implementation details | Poor (11 isolation rules, auth patterns, mocking tables) | ✅ Removed most |
| Behavioral requirements | Poor (procedural recipes) | ✅ Success criteria per step |
| Chain of thought | Weak (unstructured DIAGNOSE) | ✅ Error/Root cause/Fix target |
| Critical instruction positioning | Poor | Moderate |
| Avoid mega-prompts | Violates | Still violates (by design for --one-session) |
| Reiterate critical constraints | Over-does it (3x) | May under-do it (1x) |
Net: Substantially more aligned with the prompting guide than the old prompt.
Alignment with Anthropic's Latest Prompting Best Practices
Strong alignment:
- ✅ "Tell what to do, not what not to do" — positive framing throughout
- ✅ "Be clear and direct" — success criteria lead each step
- ✅ "Add context to improve performance" — explains why (e.g., the
\nwarning explains file-writing tools mangle escape sequences) - ✅ "Prefer general instructions over prescriptive steps" — goals + guidelines instead of rigid pseudocode loops
- ✅ "Structure with XML tags" —
<files>,<module_specification>,<generated_code> - ✅ "Avoid overengineering" — emphasis on minimality
Could improve (not blockers):
- No role assignment sentence (Anthropic recommends even a single sentence)
- No few-shot examples of the new output formats (diagnosis blocks, test plan comments)
- Some remaining CRITICAL/MUST/NEVER language that Anthropic says to dial back for 4.6-era models
- The "never report done if tests fail" instruction appears only once — Anthropic's middle-loss research suggests critical constraints benefit from appearing at beginning AND end
These are nice-to-haves for a follow-up, not blockers.
Key Differences Between Old and New Prompt
The most impactful change is the verify step (Step 3). The old prompt focused on comparing program output to the spec — did the example print the right things? The new prompt focuses on verifying the code implements the spec correctly — reading each requirement, checking the corresponding code, and hunting for latent bugs (off-by-one, wrong boolean gating, aliasing vs copying). This is a meaningful quality improvement.
Other highlights:
- Deduplication: import rules stated 3x → 1x; "never report done" stated 3x → 1x
- New failure-prevention rules grounded in real observed failures (
\nmangling,try/except ImportErrorstubs) - Regression test requirement: bugs fixed in Steps 2-3 get dedicated regression tests
- Test plan completeness check in Final Gate
Issue Found During Testing
In the stack_calculator test, the agent wrote the new test file at the project root (test_stack_calculator.py) rather than in tests/test_stack_calculator.py alongside the existing tests. The original 3 tests in tests/ were preserved, but the 26 new tests ended up in a separate file at a different location.
This could be a prompt issue or an orchestrator issue (depends on what {test_path} resolves to). Could you investigate whether:
- The
{test_path}template variable was correctly set totests/test_stack_calculator.py - If so, whether the agent ignored it and chose its own path
- Whether this reproduces in your eval scenarios
If it turns out to be a prompt issue, a small clarification like "Write tests at {test_path} (this exact path, do not create a different file)" might help. If it's an orchestrator issue, we can track it separately.
Summary
This is a high-quality contribution. The prompt is better by every measure we checked — shorter, better-aligned with prompting best practices, more effective in live testing, and grounded in real failure analysis. The evaluation methodology (9 scenarios, LLM-as-judge, hard assertions) sets a great standard for future prompt changes.
Looking forward to getting this merged once the test path issue is investigated! 🎉
|
I updated the prompt file to prevent chances of the bug appearing again (the agent writing files to incorrect locations). It wan't an orchestrator issue. The I tried to reproduce the bug you experienced with Gemini 3.1 Pro. I ran 3 runs with the prompt file before these wording changes. However, none of the 3 runs recreated the issue.
However, it is plausible that the bug can reoccur because of the ambiguous old wording. Non-deterministic agent behavior and different providers (Gemini vs. Claude) just makes it harder to recreate. To avoid this, I added explicit path references to three locations in the prompt here a file-write instruction didn't explicitly specify the target path : - Otherwise, write a concise, runnable example.
+ Otherwise, write a concise, runnable example at `{example_path}`.
- Add NEW test functions for those gaps.
+ Add new test functions for those gaps to this same file (`{test_path}`).
- Write a new test file from scratch.
+ Write a new test file at `{test_path}`.I ran the S4 test again after this update, and files were written to the correct locations.
|
…iles
During PR review testing, the agent wrote new tests to the project root
instead of the specified {test_path}. The prompt instructions for creating
or appending to files didn't always explicitly state the target path.
Three locations updated:
- Step 1 (example): specify {example_path} when creating new examples
- Step 4b (test exists): clarify new tests go in the same file
- Step 4b (test doesn't exist): specify {test_path} for new test files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
a372139 to
cfbfe3a
Compare
|
Merged into What we validated before merging
Thanks again for the thorough work on this, Niti! |
Cherry-picked from #713 for local testing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three locations updated to explicitly specify target paths:
- Step 1: "write a concise, runnable example at {example_path}"
- Step 4b (exists): "Add new tests to this same file ({test_path})"
- Step 4b (new): "Write a new test file at {test_path}"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merges #713 (by niti-go). Rewrites the one-session sync mega-prompt from 440 lines to 144 lines with goal-oriented framing, deduplication, and targeted failure-prevention rules. Adds automated evaluation test suite with 9 scenarios (s1-s9). Key improvements: - 67% prompt reduction while winning 8/9 eval scenarios - Success criteria lead each step instead of rigid pseudocode loops - Test plan required as comment block at top of test file - Diagnosis (Error/Root cause/Fix target) required before fixing - Targeted rules for observed failures (\\n mangling, stub fallbacks) Tested with Claude (live) and Gemini 3.1 Pro Preview (eval suite). Cloud batch: 75/75 passed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes #681
This PR improves the pdd one-session sync mega-prompt (the instructions given to an LLM agent during
pdd sync --one-session) and adds an automated evaluation test suite. The new prompt is 144 lines compared to the old version's 440 lines (a 67% reduction), while producing higher-quality results across 8 diverse test scenarios.Improvements
Better test quality
The new prompt consistently generates better quality test suites. It requires a test plan as a comment block at the top of the test file (derived from the prompt requirement list), requires value-checking assertions (not just type/existence checks), and verifies that every requirement is tested at the end.
During evaluation, I found:
test_push_and_pop→test_push_pop, changed expressions and assertions). The new prompt kept every original test name and assertion intact, only adding new tests alongside them.pytest.approx()for floating-point comparisons and checks specific values. The old prompt sometimes used exact float equality (fragile) and fewer assertions per function.Evaluation Results Summary
The new prompt won overall quality in 8 out of 9 scenarios, evaluated with Gemini 3.1 Pro Preview. It tied in 1 scenario where both prompts failed due to a provider issue (not a prompt issue).
math.isclose; old generated 6 with fragile float==Note on S8: This scenario involves a ~2500-line DAG workflow engine with state machines, threading, retry logic, and cycle detection. Both prompts struggled here for reasons outside prompt control: on some runs, the provider's response was rejected by PDD's false-positive detection before the agent could start; on other runs, the agent did generate tests (reaching 81% code coverage) but the generated tests themselves hung during execution (likely due to thread/event-loop tests that block). I think that for modules of this complexity, rather than prompt changes, we would need to improveme the agentic infrastructure itself, such as adding test execution timeouts or detecting false-positive CLI responses.
Goal-Framing and Success Criterion
The old prompt was procedural — "run X, if Y do Z" — which constrains the agent to a single predetermined sequence. Research on LLM prompting shows that LLMs perform better when given desired outcomes rather than rigid procedures, because they retain the flexibility to adapt their approach to the actual situation.
The new prompt introduces each step by framing the goal and providing success criterion before listing out any procedure. Also, the crash-fix and test-fix steps now require the agent to emit a written diagnosis (error, root cause, fix target) before attempting to fix. This improves its reasoning before acting.
During evaluation, the agent made better decisions about what to edit:
strict=Trueon a function that doesn't accept it), the new prompt made a simple 1-line fix. The old prompt rewrote the entire example file.Fewer agent failure modes
I analyzed failure patterns including wasteful loops and code bugs in old one-session sync runs to add targeted instructions to the new prompt. For example, the new prompt warns against
\ninside string literals (file-writing tools mangle them) and prohibitstry/except ImportErrorstub fallbacks.In testing:
\ncharacters inside string literals that broke when the file-writing tool mangled them. The new prompt warns against this pattern, and the agent avoided it entirely.{{token}}instead of{token}, causing error messages to show the literal text{token}instead of the variable's value). The new prompt didn't introduce this bug.What Makes the New Prompt Better
The new prompt is grounded in research on LLM prompt techniques.
How I Iterated on the Prompt File
I started by researching best prompting guidelines for agents. Then, I analyzed previous commits from when I was developing the smarter auto-deps feature (PR #63 in pdd_cap). I had used the old one-session sync on the
insert_includesandsummarize_directorymodules. I studied the results of those runs to find where it could be improved.I then ran my new and old prompt versions against the test scenarios below, analyzed where the new version needed improvement, and continued iterating and re-testing.
Testing and Evaluation
I designed 9 diverse syncing scenarios such that the code, examples, and tests are 'out of sync' in different ways. Scenarios 1-6 are small, focused modules. Scenarios 7-9 test against large modules (~2500 lines each) to verify the prompt scales to real-world complexity. I tested both the new and original prompt on each scenario with Gemini 3.1 Pro Preview.
Scenario 1: Everything is already in sync
text_statsmodule that computes word count, sentence count, and reading time.Scenario 2: Generate examples and tests from scratch
unit_convertermodule that converts between Celsius/Fahrenheit, km/miles, and kg/pounds.Scenario 3: Fix a code bug
shopping_cartmodule that manages items and applies a 10% discount for subtotals over $100.Scenario 4: Expand test coverage
stack_calculatormodule that evaluates math expressions in Reverse Polish Notation.Scenario 5: Fix a crashing example
password_validatormodule that checks length, uppercase, digits, special characters and rates strength.Scenario 6: Fix wrong tests
date_rangemodule that manages date ranges with half-open intervals and supports overlap/containment checks.Scenario 7: Large code bug
llm_clientmodule (~2500 lines) for calling LLM APIs with retry logic, cost tracking, and model selection. Based on PDD's actualllm_invokemodule.total_cost = attempt_costshould betotal_cost += attempt_cost, causing costs from retry attempts to be discarded. Existing tests catch this bug. Sync needs to find and fix the one line.Scenario 8: Large expand test coverage
workflow_enginemodule (~2500 lines) implementing a DAG-based task executor with state machines, retry logic, cycle detection, budget management, and parallel execution.Scenario 9: Large wrong tests
config_managermodule (~2500 lines) for hierarchical configuration with type validation, env var loading, and JSON import/export.get()/save(), not onset()). The code is correct, but 8 tests wrongly assume eager validation. Sync needs to fix the tests, not the code.For each scenario and prompt version, I recorded the full
pdd syncoutput, the final code, example, and test files produced, the pytest results after sync completed, the example script output after sync completed and the full Gemini session logs (every thought, tool call, and retry loop).Then, I compared each output against the reference files to see exactly what the agent changed, whether the changes were appropriate, and whether the agent correctly identified what needed fixing vs. what should be left alone.
I have attached all 6 scenario test results and the full comparison analysis to this PR in
/evalandevaluation_report.md.eval.zip
Automated Evaluation Test Suite
This PR adds
tests/test_one_session_eval.py, a test suite that automates prompt comparison. It:ThreadPoolExecutor— each variant gets its own isolated project directory and prompt template (viaPDD_PATHoverride), so they don't interfere with each other.pytest --cov, and the agent's stdout. It scores on test quality, example quality, minimality, correctness, and overall winner.The test suite is designed for iterating on the prompt:
pdd/prompts/one_session_agent_LLM.promptPDD_RUN_AGENTIC_TESTS=1 pytest tests/test_one_session_eval.py -k test_s4 -vsto test a specific scenarioThis works with any LLM provider — the scenarios are provider-agnostic, and the suite can be re-run as models evolve to verify the prompt still performs well.