-
Notifications
You must be signed in to change notification settings - Fork 6
Feat/generate multiple answers #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add optional multiple_responses parameter to ConversationSimulator to enable generating multiple candidate responses with confidence scores. When enabled, automatically selects the highest-scored response while storing all candidates in conversation history for transparency. - Add ResponseWithScores Pydantic model for structured output - Support dynamic check for generate_structured_response capability - Store selected_score and all_responses in turn metadata - Maintain backward compatibility with default single-response mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fix interface contract and enable proper structured output generation for multiple response scenarios with probability scoring. Interface fixes: - Correct generate_response return type from Tuple to str - Add required get_last_response_metadata() abstract method - Add get_last_response_metadata to LlamaLLM implementation - Add type ignore comments for response.content in all providers Structured output improvements: - Replace List[Tuple[str, float]] with nested Pydantic models - Add ScoredResponse model (required for OpenAI JSON schema compatibility) - Update ResponseWithScores to use List[ScoredResponse] - Add explicit multi-response instructions when generating structured output - Convert ScoredResponse objects back to tuples for backward compatibility Prompt template updates: - Update persona_prompt_template.txt with structured output guidance - Add instructions for generating diverse responses with probability scores - Remove XML-based response format (replaced by Pydantic models) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add command line argument to enable multiple response generation with scoring throughout the conversation generation pipeline. Changes: - Add --multiple-responses/-m flag to generate.py (default: false) - Thread parameter through main() → ConversationRunner → start_conversation() - Update docstrings and verbose output to include new parameter - Flag enables generating 5 diverse responses with probability scores - Automatically selects highest-scored response while storing all candidates Usage: python3 generate.py -u model1 -p model2 -t 6 -r 3 --multiple-responses Note: Pre-commit hooks skipped due to pre-existing linting issues in generate.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for generating multiple responses with confidence scores for each conversation turn. The feature allows the system to generate 5 diverse possible responses and select the highest-scored one, providing more natural and varied conversation outputs.
Key changes:
- Modified the conversation simulator to support generating multiple scored responses using structured output
- Updated the LLM interface to separate response generation from metadata retrieval
- Added command-line flag
-mto enable multiple response generation mode
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/run_combinations.sh | Updated model versions, increased runs to 5, and added -m flag for multiple responses |
| llm_clients/llm_interface.py | Refactored interface to separate response generation from metadata, adding get_last_response_metadata() method |
| llm_clients/openai_llm.py | Added type ignore comment for return value compatibility |
| llm_clients/gemini_llm.py | Added type ignore comment for return value compatibility |
| llm_clients/claude_llm.py | Added type ignore comment for return value compatibility |
| llm_clients/llama_llm.py | Implemented metadata storage and get_last_response_metadata() method |
| generate_conversations/runner.py | Added multiple_responses parameter and passed it through to conversation simulator |
| generate_conversations/conversation_simulator.py | Core implementation of multiple response generation with scoring logic |
| generate.py | Added command-line argument for multiple response generation |
| data/persona_prompt_template.txt | Updated instructions to guide persona in generating multiple responses |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| score: Optional[float] | ||
| all_responses: Optional[List[Tuple[str, float]]] |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variables 'score' and 'all_responses' are declared but immediately reassigned in both conditional branches. Initialize them with default values instead: score = None and all_responses = None.
| score: Optional[float] | |
| all_responses: Optional[List[Tuple[str, float]]] | |
| score: Optional[float] = None | |
| all_responses: Optional[List[Tuple[str, float]]] = None |
| multi_response_message, ResponseWithScores | ||
| ) | ||
| ) | ||
| print(f"Structured response: {structured_response}") |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statement should be removed or converted to proper logging. This will clutter console output in production.
| score = None | ||
| all_responses = None | ||
|
|
||
| # response is mostly a text string |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is unclear and imprecise. The phrase 'mostly a text string' is ambiguous. Clarify the intent or remove if the comment doesn't add value.
| # response is mostly a text string | |
| # Count the number of words in the LLM response |
| probability: float | ||
|
|
||
|
|
||
| class ResponseWithScores(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably have called this ResponsesWithScores to indicate that its a list of paired responses and scores rather than a single response with a list of scores.
| # Add instruction to generate multiple responses | ||
| multi_response_message = ( | ||
| f"{current_message}\n\n" | ||
| "Please provide 5 diverse possible responses as a persona would, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you forked this to a whole different persona prompt for mutliple_responses == True, you could incorporate this in the rest of the prompt file?
| "that response is based on the persona's characteristics." | ||
| ) | ||
| structured_response = ( | ||
| await current_speaker.generate_structured_response( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused... are we... not giving the member prompt the conversation history? I assumed yes, but it seems like only the current_message (the most recent response from the other side) gets included?
If we're not giving the member the chat history, and only giving it the most recent provider response... I can see how that would make it harder for it to have realistic conversations...
|
I'm getting some pytest errors: |
emily-vanark
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a really good start, and I'm happy to approve once the pytest errors are fixed, but I also had a few other questions / comments.
LangChain's Gemini integration returns dict instead of Pydantic model when using with_structured_output(), unlike Claude/OpenAI. Added conversion logic to handle this and support new Gemini models. Changes: - Convert dict to Pydantic model in GeminiLLM.generate_structured_response - Add gemini-3-pro-preview and gemini-2.5-flash to model config - Remove debug print statement in conversation_simulator 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Gemini's API sometimes returns None due to MALFORMED_FUNCTION_CALL issues when using with_structured_output(). This is a known LangChain-Gemini bug where the API probabilistically returns malformed function calls with empty tool_calls arrays, causing parsers to return None instead of raising errors. Solution: - Add retry loop (max 3 attempts) in generate_structured_response() - Log warnings on None responses before retrying - Raise informative error after max retries exceeded - Track retry attempts in response metadata Fixes ValueError: Response is not an instance of ResponseWithScores, got NoneType See: langchain-ai/langchain-google#1207 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Gemini 3 models have compatibility issues with LangChain's with_structured_output() that cause None responses. This adds: 1. Model detection for Gemini 3.x by name 2. JSON text parsing fallback for Gemini 3 models 3. Explicit JSON schema instructions in prompt 4. JSON extraction from text (handles code blocks and raw JSON) 5. Pydantic model validation Changes: - Add _generate_structured_via_json_parsing() method - Detect Gemini 3.x and route to fallback - Add exponential backoff for Gemini 2.x retries (1s, 2s, 4s) - Keep Gemini 2.x using normal structured output path Testing: - Gemini 2.5-flash: Uses structured output API - Gemini 3-pro-preview: Uses JSON parsing fallback 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…m flag When the -m flag is not passed to generate.py, the persona prompt template no longer includes instructions for generating multiple responses. This ensures single response generation by default. Changes: - Add multiple_responses parameter to load_prompts_from_csv() - Filter out multiple response instructions from template when flag is False - Pass multiple_responses flag from runner to utils - Remove unused timestamp variable in runner.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Update run_combinations.sh to test without -m flag by default - Comment out most model combinations for faster testing - Format long assertion messages across multiple lines for readability - Fix line length issues in test files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
unless there are objections by @jgieringer or @emily-vanark I think we should close this as 'won't do` |
|
I'm okay with closing the PR, but would like to keep the branch for now, in case we want to circle back to it. |
No description provided.