Skip to content

Improve one-session sync: better quality, speed, and cost#713

Merged
gltanaka merged 23 commits intopromptdriven:mainfrom
niti-go:change/issue-681
Apr 9, 2026
Merged

Improve one-session sync: better quality, speed, and cost#713
gltanaka merged 23 commits intopromptdriven:mainfrom
niti-go:change/issue-681

Conversation

@niti-go
Copy link
Copy Markdown
Contributor

@niti-go niti-go commented Mar 23, 2026

Closes #681

This PR improves the pdd one-session sync mega-prompt (the instructions given to an LLM agent during pdd sync --one-session) and adds an automated evaluation test suite. The new prompt is 144 lines compared to the old version's 440 lines (a 67% reduction), while producing higher-quality results across 8 diverse test scenarios.

Improvements

Better test quality

The new prompt consistently generates better quality test suites. It requires a test plan as a comment block at the top of the test file (derived from the prompt requirement list), requires value-checking assertions (not just type/existence checks), and verifies that every requirement is tested at the end.

During evaluation, I found:

  • More tests: In a scenario where the agent needed to expand a minimal 3-test file, the new prompt produced 17 tests covering every requirement, while the old produced 10 tests with coverage gaps in subtraction, multiplication, invalid tokens, and empty expressions.
  • Better coverage: In another scenario generating tests from scratch for a password validator, the new prompt generated 16 tests (covering each validation check individually, all strength levels, edge cases like empty passwords) vs. the old prompt's 7 tests.
  • Preserves existing tests: The old prompt renamed and rewrote all 3 original tests in one scenario (test_push_and_poptest_push_pop, changed expressions and assertions). The new prompt kept every original test name and assertion intact, only adding new tests alongside them.
  • Follows PDD conventions: The new prompt writes a test plan as a comment block at the top of the test file. The old prompt does not.
  • Stronger assertions: The new prompt uses pytest.approx() for floating-point comparisons and checks specific values. The old prompt sometimes used exact float equality (fragile) and fewer assertions per function.

Evaluation Results Summary

The new prompt won overall quality in 8 out of 9 scenarios, evaluated with Gemini 3.1 Pro Preview. It tied in 1 scenario where both prompts failed due to a provider issue (not a prompt issue).

Scenario Winner Key Result
S1: All in sync New New made zero changes; old rewrote tests (26→18)
S2: Generate from scratch New New generated 12 tests with math.isclose; old generated 6 with fragile float ==
S3: Fix a code bug New New fixed 1 line; old also rewrote tests (18→13) and example
S4: Expand test coverage New New preserved 3 original tests + added 19; old rewrote them
S5: Fix crashing example New New made 1-line fix + generated 16 tests; old rewrote entire example
S6: Fix wrong tests New New fixed all 6 wrong tests + added 1 new; old fixed 6 but added none
S7: Large code bug (2500 LOC) New New fixed bug + preserved tests; old timed out at 720s
S8: Large expand tests (2500 LOC) Tie Both failed (see note below)
S9: Large wrong tests (2500 LOC) New New fixed 8 wrong tests + added 76 new; old deleted tests (59→22)

Note on S8: This scenario involves a ~2500-line DAG workflow engine with state machines, threading, retry logic, and cycle detection. Both prompts struggled here for reasons outside prompt control: on some runs, the provider's response was rejected by PDD's false-positive detection before the agent could start; on other runs, the agent did generate tests (reaching 81% code coverage) but the generated tests themselves hung during execution (likely due to thread/event-loop tests that block). I think that for modules of this complexity, rather than prompt changes, we would need to improveme the agentic infrastructure itself, such as adding test execution timeouts or detecting false-positive CLI responses.

Goal-Framing and Success Criterion

The old prompt was procedural — "run X, if Y do Z" — which constrains the agent to a single predetermined sequence. Research on LLM prompting shows that LLMs perform better when given desired outcomes rather than rigid procedures, because they retain the flexibility to adapt their approach to the actual situation.

The new prompt introduces each step by framing the goal and providing success criterion before listing out any procedure. Also, the crash-fix and test-fix steps now require the agent to emit a written diagnosis (error, root cause, fix target) before attempting to fix. This improves its reasoning before acting.

During evaluation, the agent made better decisions about what to edit:

  • When everything was already in sync, the new prompt left the example file unchanged (correct). The old prompt unnecessarily rewrote it with different sample text.
  • When an example crashed due to a bad argument (strict=True on a function that doesn't accept it), the new prompt made a simple 1-line fix. The old prompt rewrote the entire example file.

Fewer agent failure modes

I analyzed failure patterns including wasteful loops and code bugs in old one-session sync runs to add targeted instructions to the new prompt. For example, the new prompt warns against \n inside string literals (file-writing tools mangle them) and prohibits try/except ImportError stub fallbacks.

In testing:

  • No crash-fix loops: The old prompt's agent got stuck in a 5+ iteration loop trying to fix \n characters inside string literals that broke when the file-writing tool mangled them. The new prompt warns against this pattern, and the agent avoided it entirely.
  • No code bugs introduced: The old prompt's agent introduced an f-string escaping bug in one scenario ({{token}} instead of {token}, causing error messages to show the literal text {token} instead of the variable's value). The new prompt didn't introduce this bug.

What Makes the New Prompt Better

The new prompt is grounded in research on LLM prompt techniques.

  • Fewer instructions: The old prompt repeated itself a lot. For example, import rules were stated 3+ times. Research shows that LLM accuracy drops linearly as you add more instructions.
  • Goals instead of procedures: The old prompt was step-by-step recipes — "Run X, if Y do Z." The new prompt leads each step by framing the goal and providing success criterion, then gives guidelines. This lets the agent adapt instead of rigidly following a script.
  • Positive over negative instructions: The old prompt had many "Do NOT" and "NEVER" rules. The new prompt reframes these where possible.
  • Less wasted tokens: The old prompt wasted tokens on things that don't help the agent, like documentation-style headings and telling the agent to "read" files that are already in context. The new prompt removes all of this.

How I Iterated on the Prompt File

I started by researching best prompting guidelines for agents. Then, I analyzed previous commits from when I was developing the smarter auto-deps feature (PR #63 in pdd_cap). I had used the old one-session sync on the insert_includes and summarize_directory modules. I studied the results of those runs to find where it could be improved.

I then ran my new and old prompt versions against the test scenarios below, analyzed where the new version needed improvement, and continued iterating and re-testing.

Testing and Evaluation

I designed 9 diverse syncing scenarios such that the code, examples, and tests are 'out of sync' in different ways. Scenarios 1-6 are small, focused modules. Scenarios 7-9 test against large modules (~2500 lines each) to verify the prompt scales to real-world complexity. I tested both the new and original prompt on each scenario with Gemini 3.1 Pro Preview.

Scenario 1: Everything is already in sync

  • A text_stats module that computes word count, sentence count, and reading time.
  • The prompt file, code, example, and 26 passing tests are already in sync in the beginning. Running sync now should make minimal changes.

Scenario 2: Generate examples and tests from scratch

  • A unit_converter module that converts between Celsius/Fahrenheit, km/miles, and kg/pounds.
  • The code and prompt file exist are already in sync. Sync needs to generate the example and test files.

Scenario 3: Fix a code bug

  • A shopping_cart module that manages items and applies a 10% discount for subtotals over $100.
  • The test file has some failing tests that catch a real bug in the code. Sync needs to fix the code to pass these tests.

Scenario 4: Expand test coverage

  • A stack_calculator module that evaluates math expressions in Reverse Polish Notation.
  • The code is correct but only 3 basic tests exist. Sync needs to add comprehensive tests and they should pass at the end.

Scenario 5: Fix a crashing example

  • A password_validator module that checks length, uppercase, digits, special characters and rates strength.
  • The example exists but crashes due to a bad function argument. No tests exist. Sync needs to fix the example and generate tests.

Scenario 6: Fix wrong tests

  • A date_range module that manages date ranges with half-open intervals and supports overlap/containment checks.
  • The code is already correct and aligned with the prompt, but the tests are failing because they are unaligned with the prompt. Sync needs to fix the tests, and not the code.

Scenario 7: Large code bug

  • An llm_client module (~2500 lines) for calling LLM APIs with retry logic, cost tracking, and model selection. Based on PDD's actual llm_invoke module.
  • The code has a single-character bug: total_cost = attempt_cost should be total_cost += attempt_cost, causing costs from retry attempts to be discarded. Existing tests catch this bug. Sync needs to find and fix the one line.

Scenario 8: Large expand test coverage

  • A workflow_engine module (~2500 lines) implementing a DAG-based task executor with state machines, retry logic, cycle detection, budget management, and parallel execution.
  • Only 5 basic tests exist covering ~20% of requirements. Sync needs to add comprehensive tests.

Scenario 9: Large wrong tests

  • A config_manager module (~2500 lines) for hierarchical configuration with type validation, env var loading, and JSON import/export.
  • The spec says lazy validation (validate on get()/save(), not on set()). The code is correct, but 8 tests wrongly assume eager validation. Sync needs to fix the tests, not the code.

For each scenario and prompt version, I recorded the full pdd sync output, the final code, example, and test files produced, the pytest results after sync completed, the example script output after sync completed and the full Gemini session logs (every thought, tool call, and retry loop).

Then, I compared each output against the reference files to see exactly what the agent changed, whether the changes were appropriate, and whether the agent correctly identified what needed fixing vs. what should be left alone.

I have attached all 6 scenario test results and the full comparison analysis to this PR in /eval and evaluation_report.md.
eval.zip

Automated Evaluation Test Suite

This PR adds tests/test_one_session_eval.py, a test suite that automates prompt comparison. It:

  1. Runs both prompt versions in parallel against each scenario using ThreadPoolExecutor — each variant gets its own isolated project directory and prompt template (via PDD_PATH override), so they don't interfere with each other.
  2. Uses an LLM-as-judge to compare the outputs side by side. The judge receives unified diffs (what each agent changed), the full content of newly created files, actual code coverage percentages from pytest --cov, and the agent's stdout. It scores on test quality, example quality, minimality, correctness, and overall winner.
  3. Hard assertions catch objective failures: sync must exit 0, specific bugs must be fixed, test counts must meet thresholds, correct files must not be modified.

The test suite is designed for iterating on the prompt:

  • Edit pdd/prompts/one_session_agent_LLM.prompt
  • Run PDD_RUN_AGENTIC_TESTS=1 pytest tests/test_one_session_eval.py -k test_s4 -vs to test a specific scenario
  • The output shows which version won on each dimension, what the agent did wrong, and specific suggestions for prompt improvements

This works with any LLM provider — the scenarios are provider-agnostic, and the suite can be re-run as models evolve to verify the prompt still performs well.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates PDD’s one-session sync “mega-prompt” to be substantially shorter while aiming to improve result quality, speed, and cost, and documents the --one-session CLI option in the README.

Changes:

  • Replaced pdd/prompts/one_session_agent_LLM.prompt with a redesigned, shorter one-session sync instruction template.
  • Documented --one-session / --no-one-session behavior and defaults in README.md.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pdd/prompts/one_session_agent_LLM.prompt Rewrites the one-session agent prompt (example → crash-fix → verify → tests) with new goals/criteria and condensed guidance.
README.md Adds CLI option documentation and explains one-session mode defaults and usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pdd/prompts/one_session_agent_LLM.prompt
Comment thread pdd/prompts/one_session_agent_LLM.prompt Outdated
@gltanaka
Copy link
Copy Markdown
Contributor

auto eval (LLM as a judge)
gemini 3.0+
original testcase from prior one session
larger test cases

target 3/30

@gltanaka
Copy link
Copy Markdown
Contributor

Hi Niti! The 14 CI failures you're seeing were already fixed on main in commit 94da15e (#942). Your branch just needs to pick up that fix. Please rebase onto main:

git fetch origin
git rebase origin/main
git push --force-with-lease

That should clear all 14 failures on re-run.

@niti-go
Copy link
Copy Markdown
Contributor Author

niti-go commented Mar 29, 2026

Target: 4/6

niti-go and others added 13 commits April 3, 2026 14:54
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- workflow_engine_python.prompt is a specification for a DAG workflow engine that figures out the right order to run tasks in
- workflow_engine.py is an implementation with 3 bugs: rollback order is alphabetical descending sort, not a reversal; get_execution_plan mutates state, and cache uses id() for hashing rather than content.
- one_session_agent_LLM_improved.prompt is the version I am iterating on.
- Analyzed a previous run of original one-session sync on the insert_includes module.
- Based on the original prompt's drawbacks in that case study, made the following improvements to the new version:
- New "Test plan" paragraph: Instructs the agent to use Step 3's requirement list as a checklist, and explicitly add entries for parameter forwarding, branch coverage, return shape, and error cases — before writing any tests.
- Two new bullet points in the test design principle (lines 73-74): parameter forwarding verification and branch coverage.
- New organization line (line 76): Group tests by requirement, keep imports clean.
- Reworded mock guidance (line 80): "Mock the function's direct dependencies (other modules it calls), not its own internal logic" — replaces the ambiguous "do not mock internal implementation details or functions."
- Added paragraph requiring labeled sections per feature in the example, with verification that each demo actually exercises its claimed behavior (the broken cache demo fix)
- Step 2 success now requires "visible output" not just exit code 0 (catches silent no-op exits)
- Added "wrong boolean gating" and "does each branch fire only when it should?" to the latent bug checklist (catches the elif verbose bug class)
- Changed "write this list out" to "print this numbered list to stdout" (forces externalization of requirement list)
- Added compound condition boundary testing (each sub-condition failing independently)
- Added boolean parameter False-path testing (verbose=False produces no side effects)
- Added step name reference table
- Explicit "Never modify the prompt file — it is read-only" prohibition
- Added {basename} and {language} to the files block
- Added tmp_path rule for filesystem test isolation
- Restated file modification permissions for the test-fix loop
- For examples: Removed "demonstrates every requirement" success criterion → replaced with "someone can fully understand how to use the module from reading it"
- For tests: Added Z3 formal verification: "For each edge case, decide whether it is better tested with Z3 formal verification or a unit test" (PDD convention)
- Test plan written as comment block at top of test file instead of just stdout (matches generate_test_LLM.prompt's "first part of the test file should be the detailed test plan in comments")
- Improve test quality: require value-checking assertions instead of just type/existence checks, and build test plan from spec requirements before writing tests
- Prevent common failures: verify imports actually exist, ban silent fallback stubs, always re-run after fixes
- Clarify fix logic: spec is the authority for whether to fix code or tests, don't proceed after crash-fix failure
- Ensure test behavior is the same as the original prompt (only create/modify new tests, not old tests)
- Restructure points for clarity
- Warn against \n in string literals (file-writing tools mangle them, causing 5+ fix loops)
- Add regression test entries to test plan for bugs fixed during Steps 2-3 (plan already required tests per requirement, but fixed bugs weren't explicitly called out as needing their own dedicated regression test)
- Final gate now checks test plan completeness — every plan entry must have a corresponding test before reporting done
Remove one_session_agent_LLM_improved.prompt,
one_session_agent_LLM_original.prompt, workflow_engine_python.prompt,
and workflow_engine.py to keep branch focused on issue-681 changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@niti-go niti-go force-pushed the change/issue-681 branch from 97794ce to 6d19a0a Compare April 3, 2026 18:54
niti-go added 9 commits April 5, 2026 00:40
- tests/test_one_session_eval.py
-new scenarios in  tests/fixtures/one_session_eval
Fixes:
- Run both prompt variants in parallel using ThreadPoolExecutor instead of sequentially (halves wall-clock time per scenario)
- Replace in-place prompt file swapping with PDD_PATH override dirs so parallel runs don't race on a shared file
- Set PDD_SKIP_UPDATE_CHECK=1 and stdin=DEVNULL to prevent the pdd CLI from blocking on interactive upgrade prompts
- Switch judge prompt construction from str.format() to str.replace() and use llm_invoke(messages=) to avoid KeyError on curly braces in Python code, diffs, and error messages (e.g. {token}, {name})
- Send unified diffs + new files to the judge instead of full file contents, so large modules (130KB+) don't get truncated before the judge can see the test files
- Run the LLM judge even when a variant times out, so the output includes analysis of what went wrong
- Increase default timeout from 300s to 480s (720s for large scenarios)

Prompt improvements:
- Split test-writing instructions into two paths: "if test file exists" preserves existing tests and matches their style; "if no test file" writes from scratch with class-per-requirement structure
- Update scenario descriptions (S1, S3, S4, S7) to clarify that adding new tests is acceptable — only deleting/rewriting existing passing tests should be penalized
- Improve output formatting
- Redefine LLM judge prompt to ignore cost and speed when determining the winner prompt
- That folder contains test files that are designed to fail, they are purposely written poorly for one-session sync evaluation
- The pytest.ini approach from earlier didn't work for deeply nested paths, so I removed that and used a conftest instead.
@gltanaka
Copy link
Copy Markdown
Contributor

gltanaka commented Apr 6, 2026

Greg to review

Copy link
Copy Markdown
Contributor

@gltanaka gltanaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comprehensive Review of PR #713

Niti, this is excellent work — the research-driven approach, the thorough evaluation methodology, and the quality of the writeup are all impressive. This is exactly the kind of rigorous prompt engineering contribution we want to see. Thank you for investing the time to do this properly.

I've done an in-depth analysis of the old vs new prompt, tested the new prompt live on two scenarios, and checked alignment against both our internal prompting guide and Anthropic's latest prompting best practices. Here's the full picture.


Live Testing Results

I cherry-picked the PR onto gltanaka/pdd main and ran two scenarios with the new prompt using Claude as the agent (not Gemini):

Scenario A — Expand Tests (S4 equivalent: stack_calculator)

  • Starting state: 3 basic tests, no example
  • Result: ✅ Success in 2m21s, $0.74
  • Generated 26 tests (from 3), all passing
  • Test plan written as comment block at top of test file ✅
  • Tests grouped by requirement (10 classes covering all 6 spec requirements)
  • Strong value-checking assertions (== 5.0, == 64.0), not just type checks
  • Original 3 tests preserved untouched ✅
  • Clean example generated with error handling demos

Scenario B — Greenfield (S2 equivalent: unit_converter)

  • Starting state: Code only, no tests, no example
  • Result: ✅ Success in 1m46s, $0.59
  • Generated 26 tests, all passing
  • Test plan as comment block mapping each test to spec requirement numbers ✅
  • Uses pytest.approx for floating-point comparisons in round-trip tests
  • Clean example with sys.path setup for portability

Both completed on the first pass with no crash-fix or test-fix loops. This also provides some evidence that the prompt works across models (Gemini in your eval, Claude in mine).

Existing test suite: 3,439 passed, 20 skipped, 1 failed (pre-existing agentic_sync.py bug on main, unrelated to this PR).


Alignment with Internal Prompting Guide (docs/prompting_guide.md)

Guide Principle Old Prompt New Prompt
Positive over negative constraints Poor (many DO NOT/NEVER) ✅ Good (mostly positive framing)
Prompt brevity / ratio Poor (440 lines) ✅ Better (139 lines)
Don't include implementation details Poor (11 isolation rules, auth patterns, mocking tables) ✅ Removed most
Behavioral requirements Poor (procedural recipes) ✅ Success criteria per step
Chain of thought Weak (unstructured DIAGNOSE) ✅ Error/Root cause/Fix target
Critical instruction positioning Poor Moderate
Avoid mega-prompts Violates Still violates (by design for --one-session)
Reiterate critical constraints Over-does it (3x) May under-do it (1x)

Net: Substantially more aligned with the prompting guide than the old prompt.


Alignment with Anthropic's Latest Prompting Best Practices

Strong alignment:

  • ✅ "Tell what to do, not what not to do" — positive framing throughout
  • ✅ "Be clear and direct" — success criteria lead each step
  • ✅ "Add context to improve performance" — explains why (e.g., the \n warning explains file-writing tools mangle escape sequences)
  • ✅ "Prefer general instructions over prescriptive steps" — goals + guidelines instead of rigid pseudocode loops
  • ✅ "Structure with XML tags" — <files>, <module_specification>, <generated_code>
  • ✅ "Avoid overengineering" — emphasis on minimality

Could improve (not blockers):

  • No role assignment sentence (Anthropic recommends even a single sentence)
  • No few-shot examples of the new output formats (diagnosis blocks, test plan comments)
  • Some remaining CRITICAL/MUST/NEVER language that Anthropic says to dial back for 4.6-era models
  • The "never report done if tests fail" instruction appears only once — Anthropic's middle-loss research suggests critical constraints benefit from appearing at beginning AND end

These are nice-to-haves for a follow-up, not blockers.


Key Differences Between Old and New Prompt

The most impactful change is the verify step (Step 3). The old prompt focused on comparing program output to the spec — did the example print the right things? The new prompt focuses on verifying the code implements the spec correctly — reading each requirement, checking the corresponding code, and hunting for latent bugs (off-by-one, wrong boolean gating, aliasing vs copying). This is a meaningful quality improvement.

Other highlights:

  • Deduplication: import rules stated 3x → 1x; "never report done" stated 3x → 1x
  • New failure-prevention rules grounded in real observed failures (\n mangling, try/except ImportError stubs)
  • Regression test requirement: bugs fixed in Steps 2-3 get dedicated regression tests
  • Test plan completeness check in Final Gate

Issue Found During Testing

In the stack_calculator test, the agent wrote the new test file at the project root (test_stack_calculator.py) rather than in tests/test_stack_calculator.py alongside the existing tests. The original 3 tests in tests/ were preserved, but the 26 new tests ended up in a separate file at a different location.

This could be a prompt issue or an orchestrator issue (depends on what {test_path} resolves to). Could you investigate whether:

  1. The {test_path} template variable was correctly set to tests/test_stack_calculator.py
  2. If so, whether the agent ignored it and chose its own path
  3. Whether this reproduces in your eval scenarios

If it turns out to be a prompt issue, a small clarification like "Write tests at {test_path} (this exact path, do not create a different file)" might help. If it's an orchestrator issue, we can track it separately.


Summary

This is a high-quality contribution. The prompt is better by every measure we checked — shorter, better-aligned with prompting best practices, more effective in live testing, and grounded in real failure analysis. The evaluation methodology (9 scenarios, LLM-as-judge, hard assertions) sets a great standard for future prompt changes.

Looking forward to getting this merged once the test path issue is investigated! 🎉

@niti-go
Copy link
Copy Markdown
Contributor Author

niti-go commented Apr 9, 2026

I updated the prompt file to prevent chances of the bug appearing again (the agent writing files to incorrect locations).

It wan't an orchestrator issue. The {test_path} template variable is correctly resolved to an absolute path by the orchestrator (construct_paths()generate_output_paths()os.path.abspath()). The agent received the correct path but chose its own location when writing.

I tried to reproduce the bug you experienced with Gemini 3.1 Pro. I ran 3 runs with the prompt file before these wording changes. However, none of the 3 runs recreated the issue.

Run Prompt Version Misplaced file? Result
1 Old (before fix) No Current 24 tests/100% cov, Original 15 tests/97.9% cov
2 Old (before fix) No Current 17 tests/100% cov/$0.79/86s, Original 16 tests/100% cov/$2.11/174s
3 Old (before fix) No Current 20 tests/100% cov/$1.02/112s, Original 14 tests/100% cov/$1.41/107s

However, it is plausible that the bug can reoccur because of the ambiguous old wording. Non-deterministic agent behavior and different providers (Gemini vs. Claude) just makes it harder to recreate.

To avoid this, I added explicit path references to three locations in the prompt here a file-write instruction didn't explicitly specify the target path :

- Otherwise, write a concise, runnable example.
+ Otherwise, write a concise, runnable example at `{example_path}`.

- Add NEW test functions for those gaps.
+ Add new test functions for those gaps to this same file (`{test_path}`).

- Write a new test file from scratch.
+ Write a new test file at `{test_path}`.

I ran the S4 test again after this update, and files were written to the correct locations.

Run Prompt Version Misplaced file? Result
2 New (after fix) No Current 17 tests/100% cov/$0.51/61s, Original 17 tests/100% cov/$2.34/180s

…iles

During PR review testing, the agent wrote new tests to the project root
instead of the specified {test_path}. The prompt instructions for creating
or appending to files didn't always explicitly state the target path.

Three locations updated:
- Step 1 (example): specify {example_path} when creating new examples
- Step 4b (test exists): clarify new tests go in the same file
- Step 4b (test doesn't exist): specify {test_path} for new test files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@niti-go niti-go force-pushed the change/issue-681 branch from a372139 to cfbfe3a Compare April 9, 2026 02:46
@gltanaka
Copy link
Copy Markdown
Contributor

gltanaka commented Apr 9, 2026

Merged into gltanaka/pdd main — will flow to the public repo on the next release.

What we validated before merging

  1. Live testing with Claude (not just Gemini): Ran two scenarios — expand tests (S4 equivalent) and greenfield (S2 equivalent). Both succeeded on first pass with strong results (26 tests each, all passing, test plans as comment blocks, value-checking assertions).

  2. Existing test suite: 3,439 passed, 20 skipped, 1 failed (pre-existing issue on main, unrelated to this PR).

  3. Cloud batch tests: 75/75 passed (pytest, regression, sync_regression, cloud_regression, vitest).

  4. Rebased onto latest origin/main (44 commits ahead) — no conflicts.

  5. Verified the file path fix (commit cfbfe3ab) addresses the test placement issue found during review.

Thanks again for the thorough work on this, Niti!

@gltanaka gltanaka merged commit 30c9605 into promptdriven:main Apr 9, 2026
1 check passed
Serhan-Asad pushed a commit that referenced this pull request Apr 15, 2026
Cherry-picked from #713 for local testing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Serhan-Asad pushed a commit that referenced this pull request Apr 15, 2026
Three locations updated to explicitly specify target paths:
- Step 1: "write a concise, runnable example at {example_path}"
- Step 4b (exists): "Add new tests to this same file ({test_path})"
- Step 4b (new): "Write a new test file at {test_path}"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Serhan-Asad pushed a commit that referenced this pull request Apr 15, 2026
Merges #713 (by niti-go). Rewrites the one-session sync
mega-prompt from 440 lines to 144 lines with goal-oriented framing,
deduplication, and targeted failure-prevention rules. Adds automated
evaluation test suite with 9 scenarios (s1-s9).

Key improvements:
- 67% prompt reduction while winning 8/9 eval scenarios
- Success criteria lead each step instead of rigid pseudocode loops
- Test plan required as comment block at top of test file
- Diagnosis (Error/Root cause/Fix target) required before fixing
- Targeted rules for observed failures (\\n mangling, stub fallbacks)

Tested with Claude (live) and Gemini 3.1 Pro Preview (eval suite).
Cloud batch: 75/75 passed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--one-session command produces example files of poor quality

3 participants