Improve one-session sync: better quality, speed, and cost by niti-go · Pull Request #713 · promptdriven/pdd

niti-go · 2026-03-23T11:12:44Z

Closes #681

This PR improves the pdd one-session sync mega-prompt (the instructions given to an LLM agent during pdd sync --one-session) and adds an automated evaluation test suite. The new prompt is 144 lines compared to the old version's 440 lines (a 67% reduction), while producing higher-quality results across 8 diverse test scenarios.

Improvements

Better test quality

The new prompt consistently generates better quality test suites. It requires a test plan as a comment block at the top of the test file (derived from the prompt requirement list), requires value-checking assertions (not just type/existence checks), and verifies that every requirement is tested at the end.

During evaluation, I found:

More tests: In a scenario where the agent needed to expand a minimal 3-test file, the new prompt produced 17 tests covering every requirement, while the old produced 10 tests with coverage gaps in subtraction, multiplication, invalid tokens, and empty expressions.
Better coverage: In another scenario generating tests from scratch for a password validator, the new prompt generated 16 tests (covering each validation check individually, all strength levels, edge cases like empty passwords) vs. the old prompt's 7 tests.
Preserves existing tests: The old prompt renamed and rewrote all 3 original tests in one scenario (test_push_and_pop → test_push_pop, changed expressions and assertions). The new prompt kept every original test name and assertion intact, only adding new tests alongside them.
Follows PDD conventions: The new prompt writes a test plan as a comment block at the top of the test file. The old prompt does not.
Stronger assertions: The new prompt uses pytest.approx() for floating-point comparisons and checks specific values. The old prompt sometimes used exact float equality (fragile) and fewer assertions per function.

Evaluation Results Summary

The new prompt won overall quality in 8 out of 9 scenarios, evaluated with Gemini 3.1 Pro Preview. It tied in 1 scenario where both prompts failed due to a provider issue (not a prompt issue).

Scenario	Winner	Key Result
S1: All in sync	New	New made zero changes; old rewrote tests (26→18)
S2: Generate from scratch	New	New generated 12 tests with `math.isclose`; old generated 6 with fragile float `==`
S3: Fix a code bug	New	New fixed 1 line; old also rewrote tests (18→13) and example
S4: Expand test coverage	New	New preserved 3 original tests + added 19; old rewrote them
S5: Fix crashing example	New	New made 1-line fix + generated 16 tests; old rewrote entire example
S6: Fix wrong tests	New	New fixed all 6 wrong tests + added 1 new; old fixed 6 but added none
S7: Large code bug (2500 LOC)	New	New fixed bug + preserved tests; old timed out at 720s
S8: Large expand tests (2500 LOC)	Tie	Both failed (see note below)
S9: Large wrong tests (2500 LOC)	New	New fixed 8 wrong tests + added 76 new; old deleted tests (59→22)

Note on S8: This scenario involves a ~2500-line DAG workflow engine with state machines, threading, retry logic, and cycle detection. Both prompts struggled here for reasons outside prompt control: on some runs, the provider's response was rejected by PDD's false-positive detection before the agent could start; on other runs, the agent did generate tests (reaching 81% code coverage) but the generated tests themselves hung during execution (likely due to thread/event-loop tests that block). I think that for modules of this complexity, rather than prompt changes, we would need to improveme the agentic infrastructure itself, such as adding test execution timeouts or detecting false-positive CLI responses.

Goal-Framing and Success Criterion

The old prompt was procedural — "run X, if Y do Z" — which constrains the agent to a single predetermined sequence. Research on LLM prompting shows that LLMs perform better when given desired outcomes rather than rigid procedures, because they retain the flexibility to adapt their approach to the actual situation.

The new prompt introduces each step by framing the goal and providing success criterion before listing out any procedure. Also, the crash-fix and test-fix steps now require the agent to emit a written diagnosis (error, root cause, fix target) before attempting to fix. This improves its reasoning before acting.

During evaluation, the agent made better decisions about what to edit:

When everything was already in sync, the new prompt left the example file unchanged (correct). The old prompt unnecessarily rewrote it with different sample text.
When an example crashed due to a bad argument (strict=True on a function that doesn't accept it), the new prompt made a simple 1-line fix. The old prompt rewrote the entire example file.

Fewer agent failure modes

I analyzed failure patterns including wasteful loops and code bugs in old one-session sync runs to add targeted instructions to the new prompt. For example, the new prompt warns against \n inside string literals (file-writing tools mangle them) and prohibits try/except ImportError stub fallbacks.

In testing:

No crash-fix loops: The old prompt's agent got stuck in a 5+ iteration loop trying to fix \n characters inside string literals that broke when the file-writing tool mangled them. The new prompt warns against this pattern, and the agent avoided it entirely.
No code bugs introduced: The old prompt's agent introduced an f-string escaping bug in one scenario ({{token}} instead of {token}, causing error messages to show the literal text {token} instead of the variable's value). The new prompt didn't introduce this bug.

What Makes the New Prompt Better

The new prompt is grounded in research on LLM prompt techniques.

Fewer instructions: The old prompt repeated itself a lot. For example, import rules were stated 3+ times. Research shows that LLM accuracy drops linearly as you add more instructions.
Goals instead of procedures: The old prompt was step-by-step recipes — "Run X, if Y do Z." The new prompt leads each step by framing the goal and providing success criterion, then gives guidelines. This lets the agent adapt instead of rigidly following a script.
Positive over negative instructions: The old prompt had many "Do NOT" and "NEVER" rules. The new prompt reframes these where possible.
Less wasted tokens: The old prompt wasted tokens on things that don't help the agent, like documentation-style headings and telling the agent to "read" files that are already in context. The new prompt removes all of this.

How I Iterated on the Prompt File

I started by researching best prompting guidelines for agents. Then, I analyzed previous commits from when I was developing the smarter auto-deps feature (PR #63 in pdd_cap). I had used the old one-session sync on the insert_includes and summarize_directory modules. I studied the results of those runs to find where it could be improved.

I then ran my new and old prompt versions against the test scenarios below, analyzed where the new version needed improvement, and continued iterating and re-testing.

Testing and Evaluation

I designed 9 diverse syncing scenarios such that the code, examples, and tests are 'out of sync' in different ways. Scenarios 1-6 are small, focused modules. Scenarios 7-9 test against large modules (~2500 lines each) to verify the prompt scales to real-world complexity. I tested both the new and original prompt on each scenario with Gemini 3.1 Pro Preview.

Scenario 1: Everything is already in sync

A text_stats module that computes word count, sentence count, and reading time.
The prompt file, code, example, and 26 passing tests are already in sync in the beginning. Running sync now should make minimal changes.

Scenario 2: Generate examples and tests from scratch

A unit_converter module that converts between Celsius/Fahrenheit, km/miles, and kg/pounds.
The code and prompt file exist are already in sync. Sync needs to generate the example and test files.

Scenario 3: Fix a code bug

A shopping_cart module that manages items and applies a 10% discount for subtotals over $100.
The test file has some failing tests that catch a real bug in the code. Sync needs to fix the code to pass these tests.

Scenario 4: Expand test coverage

A stack_calculator module that evaluates math expressions in Reverse Polish Notation.
The code is correct but only 3 basic tests exist. Sync needs to add comprehensive tests and they should pass at the end.

Scenario 5: Fix a crashing example

A password_validator module that checks length, uppercase, digits, special characters and rates strength.
The example exists but crashes due to a bad function argument. No tests exist. Sync needs to fix the example and generate tests.

Scenario 6: Fix wrong tests

A date_range module that manages date ranges with half-open intervals and supports overlap/containment checks.
The code is already correct and aligned with the prompt, but the tests are failing because they are unaligned with the prompt. Sync needs to fix the tests, and not the code.

Scenario 7: Large code bug

An llm_client module (~2500 lines) for calling LLM APIs with retry logic, cost tracking, and model selection. Based on PDD's actual llm_invoke module.
The code has a single-character bug: total_cost = attempt_cost should be total_cost += attempt_cost, causing costs from retry attempts to be discarded. Existing tests catch this bug. Sync needs to find and fix the one line.

Scenario 8: Large expand test coverage

A workflow_engine module (~2500 lines) implementing a DAG-based task executor with state machines, retry logic, cycle detection, budget management, and parallel execution.
Only 5 basic tests exist covering ~20% of requirements. Sync needs to add comprehensive tests.

Scenario 9: Large wrong tests

A config_manager module (~2500 lines) for hierarchical configuration with type validation, env var loading, and JSON import/export.
The spec says lazy validation (validate on get()/save(), not on set()). The code is correct, but 8 tests wrongly assume eager validation. Sync needs to fix the tests, not the code.

For each scenario and prompt version, I recorded the full pdd sync output, the final code, example, and test files produced, the pytest results after sync completed, the example script output after sync completed and the full Gemini session logs (every thought, tool call, and retry loop).

Then, I compared each output against the reference files to see exactly what the agent changed, whether the changes were appropriate, and whether the agent correctly identified what needed fixing vs. what should be left alone.

I have attached all 6 scenario test results and the full comparison analysis to this PR in /eval and evaluation_report.md.
eval.zip

Automated Evaluation Test Suite

This PR adds tests/test_one_session_eval.py, a test suite that automates prompt comparison. It:

Runs both prompt versions in parallel against each scenario using ThreadPoolExecutor — each variant gets its own isolated project directory and prompt template (via PDD_PATH override), so they don't interfere with each other.
Uses an LLM-as-judge to compare the outputs side by side. The judge receives unified diffs (what each agent changed), the full content of newly created files, actual code coverage percentages from pytest --cov, and the agent's stdout. It scores on test quality, example quality, minimality, correctness, and overall winner.
Hard assertions catch objective failures: sync must exit 0, specific bugs must be fixed, test counts must meet thresholds, correct files must not be modified.

The test suite is designed for iterating on the prompt:

Edit pdd/prompts/one_session_agent_LLM.prompt
Run PDD_RUN_AGENTIC_TESTS=1 pytest tests/test_one_session_eval.py -k test_s4 -vs to test a specific scenario
The output shows which version won on each dimension, what the agent did wrong, and specific suggestions for prompt improvements

This works with any LLM provider — the scenarios are provider-agnostic, and the suite can be re-run as models evolve to verify the prompt still performs well.

Copilot

Pull request overview

This PR updates PDD’s one-session sync “mega-prompt” to be substantially shorter while aiming to improve result quality, speed, and cost, and documents the --one-session CLI option in the README.

Changes:

Replaced pdd/prompts/one_session_agent_LLM.prompt with a redesigned, shorter one-session sync instruction template.
Documented --one-session / --no-one-session behavior and defaults in README.md.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`pdd/prompts/one_session_agent_LLM.prompt`	Rewrites the one-session agent prompt (example → crash-fix → verify → tests) with new goals/criteria and condensed guidance.
`README.md`	Adds CLI option documentation and explains one-session mode defaults and usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gltanaka · 2026-03-23T17:48:51Z

auto eval (LLM as a judge)
gemini 3.0+
original testcase from prior one session
larger test cases

target 3/30

gltanaka · 2026-03-24T18:58:58Z

Hi Niti! The 14 CI failures you're seeing were already fixed on main in commit 94da15e (#942). Your branch just needs to pick up that fix. Please rebase onto main:

git fetch origin
git rebase origin/main
git push --force-with-lease

That should clear all 14 failures on re-run.

niti-go · 2026-03-29T13:52:36Z

Target: 4/6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- workflow_engine_python.prompt is a specification for a DAG workflow engine that figures out the right order to run tasks in - workflow_engine.py is an implementation with 3 bugs: rollback order is alphabetical descending sort, not a reversal; get_execution_plan mutates state, and cache uses id() for hashing rather than content.

- one_session_agent_LLM_improved.prompt is the version I am iterating on.

- Analyzed a previous run of original one-session sync on the insert_includes module. - Based on the original prompt's drawbacks in that case study, made the following improvements to the new version: - New "Test plan" paragraph: Instructs the agent to use Step 3's requirement list as a checklist, and explicitly add entries for parameter forwarding, branch coverage, return shape, and error cases — before writing any tests. - Two new bullet points in the test design principle (lines 73-74): parameter forwarding verification and branch coverage. - New organization line (line 76): Group tests by requirement, keep imports clean. - Reworded mock guidance (line 80): "Mock the function's direct dependencies (other modules it calls), not its own internal logic" — replaces the ambiguous "do not mock internal implementation details or functions."

- Added paragraph requiring labeled sections per feature in the example, with verification that each demo actually exercises its claimed behavior (the broken cache demo fix) - Step 2 success now requires "visible output" not just exit code 0 (catches silent no-op exits) - Added "wrong boolean gating" and "does each branch fire only when it should?" to the latent bug checklist (catches the elif verbose bug class) - Changed "write this list out" to "print this numbered list to stdout" (forces externalization of requirement list) - Added compound condition boundary testing (each sub-condition failing independently) - Added boolean parameter False-path testing (verbose=False produces no side effects) - Added step name reference table - Explicit "Never modify the prompt file — it is read-only" prohibition - Added {basename} and {language} to the files block - Added tmp_path rule for filesystem test isolation - Restated file modification permissions for the test-fix loop

- For examples: Removed "demonstrates every requirement" success criterion → replaced with "someone can fully understand how to use the module from reading it" - For tests: Added Z3 formal verification: "For each edge case, decide whether it is better tested with Z3 formal verification or a unit test" (PDD convention) - Test plan written as comment block at top of test file instead of just stdout (matches generate_test_LLM.prompt's "first part of the test file should be the detailed test plan in comments")

- Improve test quality: require value-checking assertions instead of just type/existence checks, and build test plan from spec requirements before writing tests - Prevent common failures: verify imports actually exist, ban silent fallback stubs, always re-run after fixes - Clarify fix logic: spec is the authority for whether to fix code or tests, don't proceed after crash-fix failure

- Ensure test behavior is the same as the original prompt (only create/modify new tests, not old tests) - Restructure points for clarity

- Warn against \n in string literals (file-writing tools mangle them, causing 5+ fix loops) - Add regression test entries to test plan for bugs fixed during Steps 2-3 (plan already required tests per requirement, but fixed bugs weren't explicitly called out as needing their own dedicated regression test) - Final gate now checks test plan completeness — every plan entry must have a corresponding test before reporting done

Remove one_session_agent_LLM_improved.prompt, one_session_agent_LLM_original.prompt, workflow_engine_python.prompt, and workflow_engine.py to keep branch focused on issue-681 changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- tests/test_one_session_eval.py -new scenarios in tests/fixtures/one_session_eval

Fixes: - Run both prompt variants in parallel using ThreadPoolExecutor instead of sequentially (halves wall-clock time per scenario) - Replace in-place prompt file swapping with PDD_PATH override dirs so parallel runs don't race on a shared file - Set PDD_SKIP_UPDATE_CHECK=1 and stdin=DEVNULL to prevent the pdd CLI from blocking on interactive upgrade prompts - Switch judge prompt construction from str.format() to str.replace() and use llm_invoke(messages=) to avoid KeyError on curly braces in Python code, diffs, and error messages (e.g. {token}, {name}) - Send unified diffs + new files to the judge instead of full file contents, so large modules (130KB+) don't get truncated before the judge can see the test files - Run the LLM judge even when a variant times out, so the output includes analysis of what went wrong - Increase default timeout from 300s to 480s (720s for large scenarios) Prompt improvements: - Split test-writing instructions into two paths: "if test file exists" preserves existing tests and matches their style; "if no test file" writes from scratch with class-per-requirement structure - Update scenario descriptions (S1, S3, S4, S7) to clarify that adding new tests is acceptable — only deleting/rewriting existing passing tests should be penalized

- Improve output formatting - Redefine LLM judge prompt to ignore cost and speed when determining the winner prompt

- That folder contains test files that are designed to fail, they are purposely written poorly for one-session sync evaluation

- The pytest.ini approach from earlier didn't work for deeply nested paths, so I removed that and used a conftest instead.

gltanaka · 2026-04-06T17:35:07Z

Greg to review

gltanaka

Comprehensive Review of PR #713

Niti, this is excellent work — the research-driven approach, the thorough evaluation methodology, and the quality of the writeup are all impressive. This is exactly the kind of rigorous prompt engineering contribution we want to see. Thank you for investing the time to do this properly.

I've done an in-depth analysis of the old vs new prompt, tested the new prompt live on two scenarios, and checked alignment against both our internal prompting guide and Anthropic's latest prompting best practices. Here's the full picture.

Live Testing Results

I cherry-picked the PR onto gltanaka/pdd main and ran two scenarios with the new prompt using Claude as the agent (not Gemini):

Scenario A — Expand Tests (S4 equivalent: stack_calculator)

Starting state: 3 basic tests, no example
Result: ✅ Success in 2m21s, $0.74
Generated 26 tests (from 3), all passing
Test plan written as comment block at top of test file ✅
Tests grouped by requirement (10 classes covering all 6 spec requirements)
Strong value-checking assertions (== 5.0, == 64.0), not just type checks
Original 3 tests preserved untouched ✅
Clean example generated with error handling demos

Scenario B — Greenfield (S2 equivalent: unit_converter)

Starting state: Code only, no tests, no example
Result: ✅ Success in 1m46s, $0.59
Generated 26 tests, all passing
Test plan as comment block mapping each test to spec requirement numbers ✅
Uses pytest.approx for floating-point comparisons in round-trip tests
Clean example with sys.path setup for portability

Both completed on the first pass with no crash-fix or test-fix loops. This also provides some evidence that the prompt works across models (Gemini in your eval, Claude in mine).

Existing test suite: 3,439 passed, 20 skipped, 1 failed (pre-existing agentic_sync.py bug on main, unrelated to this PR).

Alignment with Internal Prompting Guide (`docs/prompting_guide.md`)

Guide Principle	Old Prompt	New Prompt
Positive over negative constraints	Poor (many DO NOT/NEVER)	✅ Good (mostly positive framing)
Prompt brevity / ratio	Poor (440 lines)	✅ Better (139 lines)
Don't include implementation details	Poor (11 isolation rules, auth patterns, mocking tables)	✅ Removed most
Behavioral requirements	Poor (procedural recipes)	✅ Success criteria per step
Chain of thought	Weak (unstructured DIAGNOSE)	✅ Error/Root cause/Fix target
Critical instruction positioning	Poor	Moderate
Avoid mega-prompts	Violates	Still violates (by design for `--one-session`)
Reiterate critical constraints	Over-does it (3x)	May under-do it (1x)

Net: Substantially more aligned with the prompting guide than the old prompt.

Alignment with Anthropic's Latest Prompting Best Practices

Strong alignment:

✅ "Tell what to do, not what not to do" — positive framing throughout
✅ "Be clear and direct" — success criteria lead each step
✅ "Add context to improve performance" — explains why (e.g., the \n warning explains file-writing tools mangle escape sequences)
✅ "Prefer general instructions over prescriptive steps" — goals + guidelines instead of rigid pseudocode loops
✅ "Structure with XML tags" — <files>, <module_specification>, <generated_code>
✅ "Avoid overengineering" — emphasis on minimality

Could improve (not blockers):

No role assignment sentence (Anthropic recommends even a single sentence)
No few-shot examples of the new output formats (diagnosis blocks, test plan comments)
Some remaining CRITICAL/MUST/NEVER language that Anthropic says to dial back for 4.6-era models
The "never report done if tests fail" instruction appears only once — Anthropic's middle-loss research suggests critical constraints benefit from appearing at beginning AND end

These are nice-to-haves for a follow-up, not blockers.

Key Differences Between Old and New Prompt

The most impactful change is the verify step (Step 3). The old prompt focused on comparing program output to the spec — did the example print the right things? The new prompt focuses on verifying the code implements the spec correctly — reading each requirement, checking the corresponding code, and hunting for latent bugs (off-by-one, wrong boolean gating, aliasing vs copying). This is a meaningful quality improvement.

Other highlights:

Deduplication: import rules stated 3x → 1x; "never report done" stated 3x → 1x
New failure-prevention rules grounded in real observed failures (\n mangling, try/except ImportError stubs)
Regression test requirement: bugs fixed in Steps 2-3 get dedicated regression tests
Test plan completeness check in Final Gate

Issue Found During Testing

In the stack_calculator test, the agent wrote the new test file at the project root (test_stack_calculator.py) rather than in tests/test_stack_calculator.py alongside the existing tests. The original 3 tests in tests/ were preserved, but the 26 new tests ended up in a separate file at a different location.

This could be a prompt issue or an orchestrator issue (depends on what {test_path} resolves to). Could you investigate whether:

The {test_path} template variable was correctly set to tests/test_stack_calculator.py
If so, whether the agent ignored it and chose its own path
Whether this reproduces in your eval scenarios

If it turns out to be a prompt issue, a small clarification like "Write tests at {test_path} (this exact path, do not create a different file)" might help. If it's an orchestrator issue, we can track it separately.

Summary

This is a high-quality contribution. The prompt is better by every measure we checked — shorter, better-aligned with prompting best practices, more effective in live testing, and grounded in real failure analysis. The evaluation methodology (9 scenarios, LLM-as-judge, hard assertions) sets a great standard for future prompt changes.

Looking forward to getting this merged once the test path issue is investigated! 🎉

niti-go · 2026-04-09T02:35:20Z

I updated the prompt file to prevent chances of the bug appearing again (the agent writing files to incorrect locations).

It wan't an orchestrator issue. The {test_path} template variable is correctly resolved to an absolute path by the orchestrator (construct_paths() → generate_output_paths() → os.path.abspath()). The agent received the correct path but chose its own location when writing.

I tried to reproduce the bug you experienced with Gemini 3.1 Pro. I ran 3 runs with the prompt file before these wording changes. However, none of the 3 runs recreated the issue.

Run	Prompt Version	Misplaced file?	Result
1	Old (before fix)	No	Current 24 tests/100% cov, Original 15 tests/97.9% cov
2	Old (before fix)	No	Current 17 tests/100% cov/$0.79/86s, Original 16 tests/100% cov/$2.11/174s
3	Old (before fix)	No	Current 20 tests/100% cov/$1.02/112s, Original 14 tests/100% cov/$1.41/107s

However, it is plausible that the bug can reoccur because of the ambiguous old wording. Non-deterministic agent behavior and different providers (Gemini vs. Claude) just makes it harder to recreate.

To avoid this, I added explicit path references to three locations in the prompt here a file-write instruction didn't explicitly specify the target path :

- Otherwise, write a concise, runnable example.
+ Otherwise, write a concise, runnable example at `{example_path}`.

- Add NEW test functions for those gaps.
+ Add new test functions for those gaps to this same file (`{test_path}`).

- Write a new test file from scratch.
+ Write a new test file at `{test_path}`.

I ran the S4 test again after this update, and files were written to the correct locations.

Run	Prompt Version	Misplaced file?	Result
2	New (after fix)	No	Current 17 tests/100% cov/$0.51/61s, Original 17 tests/100% cov/$2.34/180s

…iles During PR review testing, the agent wrote new tests to the project root instead of the specified {test_path}. The prompt instructions for creating or appending to files didn't always explicitly state the target path. Three locations updated: - Step 1 (example): specify {example_path} when creating new examples - Step 4b (test exists): clarify new tests go in the same file - Step 4b (test doesn't exist): specify {test_path} for new test files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gltanaka · 2026-04-09T05:48:42Z

Merged into gltanaka/pdd main — will flow to the public repo on the next release.

What we validated before merging

Live testing with Claude (not just Gemini): Ran two scenarios — expand tests (S4 equivalent) and greenfield (S2 equivalent). Both succeeded on first pass with strong results (26 tests each, all passing, test plans as comment blocks, value-checking assertions).
Existing test suite: 3,439 passed, 20 skipped, 1 failed (pre-existing issue on main, unrelated to this PR).
Cloud batch tests: 75/75 passed (pytest, regression, sync_regression, cloud_regression, vitest).
Rebased onto latest origin/main (44 commits ahead) — no conflicts.
Verified the file path fix (commit cfbfe3ab) addresses the test placement issue found during review.

Thanks again for the thorough work on this, Niti!

Cherry-picked from #713 for local testing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three locations updated to explicitly specify target paths: - Step 1: "write a concise, runnable example at {example_path}" - Step 4b (exists): "Add new tests to this same file ({test_path})" - Step 4b (new): "Write a new test file at {test_path}" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merges #713 (by niti-go). Rewrites the one-session sync mega-prompt from 440 lines to 144 lines with goal-oriented framing, deduplication, and targeted failure-prevention rules. Adds automated evaluation test suite with 9 scenarios (s1-s9). Key improvements: - 67% prompt reduction while winning 8/9 eval scenarios - Success criteria lead each step instead of rigid pseudocode loops - Test plan required as comment block at top of test file - Diagnosis (Error/Root cause/Fix target) required before fixing - Targeted rules for observed failures (\\n mangling, stub fallbacks) Tested with Claude (live) and Gemini 3.1 Pro Preview (eval suite). Cloud batch: 75/75 passed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gltanaka requested a review from Copilot March 23, 2026 16:36

Copilot started reviewing on behalf of gltanaka March 23, 2026 16:36 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Comment thread pdd/prompts/one_session_agent_LLM.prompt

Comment thread pdd/prompts/one_session_agent_LLM.prompt Outdated

niti-go and others added 13 commits April 3, 2026 14:54

Add LLM trace logging for sync behavior experiments

11e787c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix bugs in trace logging feature

0ca827f

Commit original and new one-session prompts for iteration

1acb681

- one_session_agent_LLM_improved.prompt is the version I am iterating on.

Improve prompt for clarity to agents

c9319cf

- Ensure test behavior is the same as the original prompt (only create/modify new tests, not old tests) - Restructure points for clarity

Update LLM prompt to new version after evaluation

f63209a

Add -one-session documentation to README.md

6d19a0a

niti-go force-pushed the change/issue-681 branch from 97794ce to 6d19a0a Compare April 3, 2026 18:54

niti-go added 9 commits April 5, 2026 00:40

Add automated evaluation test suite + larger sync scenarios

ff1d716

- tests/test_one_session_eval.py -new scenarios in tests/fixtures/one_session_eval

Reset agentic_common.py to upstream branch

d0a99da

Eval test changes

8e7c6b0

- Improve output formatting - Redefine LLM judge prompt to ignore cost and speed when determining the winner prompt

Address Copilot comment

523c555

Tell pytest to skip collection of tests under tests/fixtures/

d2a20f9

- That folder contains test files that are designed to fail, they are purposely written poorly for one-session sync evaluation

Compare actual test coverage during evaluation of test quality

9130a55

Merge remote-tracking branch 'upstream/main' into change/issue-681

72b6b53

Skip collection of tests under tests/fixtures

cf6545c

- The pytest.ini approach from earlier didn't work for deeply nested paths, so I removed that and used a conftest instead.

gltanaka mentioned this pull request Apr 6, 2026

fix: sync orchestrator prompt — adds BUG_STEP_TIMEOUTS (fixes CI) #759

Closed

2 tasks

gltanaka requested changes Apr 7, 2026

View reviewed changes

niti-go force-pushed the change/issue-681 branch from a372139 to cfbfe3a Compare April 9, 2026 02:46

gltanaka merged commit 30c9605 into promptdriven:main Apr 9, 2026
1 check passed

Serhan-Asad pushed a commit that referenced this pull request Apr 15, 2026

Apply PR #713 changes: improved one-session sync prompt + eval suite

7ce41eb

Cherry-picked from #713 for local testing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve one-session sync: better quality, speed, and cost#713

Improve one-session sync: better quality, speed, and cost#713
gltanaka merged 23 commits intopromptdriven:mainfrom
niti-go:change/issue-681

niti-go commented Mar 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

gltanaka commented Mar 23, 2026

Uh oh!

gltanaka commented Mar 24, 2026

Uh oh!

niti-go commented Mar 29, 2026

Uh oh!

gltanaka commented Apr 6, 2026

Uh oh!

gltanaka left a comment

Uh oh!

niti-go commented Apr 9, 2026 •

edited

Loading

Uh oh!

gltanaka commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

niti-go commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Improvements

Better test quality

Evaluation Results Summary

Goal-Framing and Success Criterion

Fewer agent failure modes

What Makes the New Prompt Better

How I Iterated on the Prompt File

Testing and Evaluation

Automated Evaluation Test Suite

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

gltanaka commented Mar 23, 2026

Uh oh!

gltanaka commented Mar 24, 2026

Uh oh!

niti-go commented Mar 29, 2026

Uh oh!

gltanaka commented Apr 6, 2026

Uh oh!

gltanaka left a comment

Choose a reason for hiding this comment

Comprehensive Review of PR #713

Live Testing Results

Alignment with Internal Prompting Guide (docs/prompting_guide.md)

Alignment with Anthropic's Latest Prompting Best Practices

Key Differences Between Old and New Prompt

Issue Found During Testing

Summary

Uh oh!

niti-go commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gltanaka commented Apr 9, 2026

What we validated before merging

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

niti-go commented Mar 23, 2026 •

edited

Loading

Alignment with Internal Prompting Guide (`docs/prompting_guide.md`)

niti-go commented Apr 9, 2026 •

edited

Loading