Skip to content

Conversation

@Bat-Reality
Copy link
Collaborator

​​Introduces a novel QA generation module based on self-challenging mechanisms, designed to autonomously synthesize high-quality reasoning-focused question-answer pairs, inspired by the MindGYM paper.​​

from data_juicer.utils.lazy_loader import LazyLoader
from data_juicer.utils.model_utils import get_model, prepare_model

torch = LazyLoader('torch', 'torch')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can import from model_utils: torch, vllm

OP_NAME = 'generate_challenging_qa_mapper'


def retry_on_error(func, max_retries=5, delay=1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use existing third party retry library or put retry_on_error in a util

@HYLcool
Copy link
Collaborator

HYLcool commented Jun 30, 2025

Please merge the latest main branch and run pre-commit locally.

@lingzhq
Copy link
Collaborator

lingzhq commented Nov 19, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new generate_challenging_qa_mapper operator, a significant feature for generating reasoning-focused QA pairs using a multi-turn conversation with a language model. The implementation is well-structured. However, I've identified a few issues that need attention. There's a critical error handling gap that could lead to a crash if the model's output is malformed. Additionally, the GPU configuration is hardcoded, which could cause runtime errors or suboptimal performance in different environments. I've also provided suggestions to improve code clarity and enhance the new test case with assertions.

Comment on lines 179 to 180
qa = self.extract_json(qa[0].outputs[0].text)
qa["thinking"] = multihop[0].outputs[0].text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If extract_json returns None because the model output is malformed, the subsequent line qa["thinking"] = ... will raise a TypeError, causing the process to crash. It's crucial to check if qa is None and handle this case gracefully, for instance, by raising a ValueError to trigger the retry mechanism with a more informative error message.

Suggested change
qa = self.extract_json(qa[0].outputs[0].text)
qa["thinking"] = multihop[0].outputs[0].text
qa = self.extract_json(qa[0].outputs[0].text)
if qa is None:
raise ValueError("Failed to extract valid JSON from model output.")
qa["thinking"] = multihop[0].outputs[0].text

"""
super().__init__(*args, **kwargs)
self.hf_model = hf_model
self.model_key = prepare_model(model_type="huggingface", pretrained_model_name_or_path=hf_model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is redundant as self.model_key is immediately overwritten on line 116. It should be removed to avoid confusion and unnecessary model preparation.

Comment on lines +30 to +31
result = op.process(deepcopy(sample))
print(f'Output results: {result}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test runs the operator but doesn't have any assertions to verify the output. A test should validate the behavior of the code. Please add assertions to check that the returned result dictionary contains the expected keys from the generated QA pair. This will make the test more meaningful and robust.

Suggested change
result = op.process(deepcopy(sample))
print(f'Output results: {result}')
result = op.process(deepcopy(sample))
self.assertIn('background_document', result)
self.assertIn('reasoning_category', result)
self.assertIn('sub_questions', result)
self.assertIn('relationship_category', result)
self.assertIn('multihop_question', result)
self.assertIn('multihop_answer', result)
self.assertIn('thinking', result)
print(f'Output results: {result}')

@lingzhq
Copy link
Collaborator

lingzhq commented Nov 20, 2025

I ran some tests locally and have a few points of feedback.

ERRORs Found During Testing:

  • Mixed Languages in Output: The model's output sometimes contains mixed Chinese and English, for example, returns a sample containing a new field named '背景文档'.

  • No Clear Termination Condition: The generation process doesn't seem to respect the requested number of outputs. I tried to generate 3 samples, but the log showed at least 50 "Processed prompts" iterations without stopping.

Other suggestions for Improvement:

  • JSON Parsing Robustness: The process still requires many retries (always up to 6 times) due to frequent JSON parsing failures.
  • Generation Configs: Consider exposing generation configurations as kwargs to the user instead of hardcoding them.
  • Performance: The generation speed is a bit slow, currently averaging around 45 toks/s for the output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants