[WIP] update video ops #867

Cathy0908 · 2025-12-23T08:56:34Z

No description provided.

gemini-code-assist · 2025-12-23T08:56:51Z

Summary of Changes

Hello @Cathy0908, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the VideoCaptioningFromFramesMapper by introducing more granular control over how video frames are sourced and how generated captions are integrated into the dataset. It provides new configuration options for specifying frame fields, managing text updates, and handling legacy processing methods, aiming to enhance flexibility and prepare for future architectural changes.

Highlights

Enhanced Video Frame Processing: The VideoCaptioningFromFramesMapper now supports processing frames directly from a specified frame_field in the sample, offering an alternative to extracting frames from video files. This provides greater flexibility in how video data is handled.
Flexible Text Update Strategies: New text_update_strategy and caption_field parameters have been introduced, allowing users to choose how generated captions update the text field. Options include keeping the original text, rewriting it with new captions, or saving captions to a separate field.
Refactored Core Logic: The internal logic for generating captions from frames and processing them has been refactored into new private methods (_gen_caption_from_frames and _process_captions), improving modularity and readability.
Deprecation Warning for Legacy Behavior: A legacy_split_by_text_token parameter is added, and a warning is issued when it's set to True, indicating that this behavior will be deprecated in future versions in favor of direct 'videos' or 'frames' field usage.
New Validation and Test Cases: Additional validation has been added for parameter combinations (e.g., caption_num with keep_candidate_mode, caption_field with text_update_strategy). Corresponding new test cases have been added to cover these new functionalities and ensure correctness.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant updates to the video captioning mapper, adding support for pre-extracted frames and offering more flexible caption handling. The refactoring into helper methods is a good step towards better code organization. However, I've identified several critical issues in the implementation. The legacy code path is broken when used with the new frame_field parameter, and the batch processing logic has flaws that can lead to runtime errors. Additionally, there are some minor issues with documentation and tests. Addressing these issues is crucial for the stability and correctness of this operator.

gemini-code-assist · 2025-12-23T08:59:37Z

data_juicer/ops/mapper/video_captioning_from_frames_mapper.py

+            for video_key in loaded_video_keys[offset:offset + video_count]:
+                if load_video_data:
+                    inp = videos[video_key]
+                else:
+                    frames = ori_sample[self.frame_field][idx]
+                    # select frames_num frames from inp frames
+                    if self.frame_num >= len(frames):
+                        inp = frames
                    else:
-                        prompt_texts = None
-
-                    inputs = processor(
-                        text=prompt_texts,
-                        images=video_frame_videos_chunk,
-                        return_tensors='pt',
-                    ).to(model.device)
-                    with torch.no_grad():
-                        for i in range(self.caption_num):
-                            generated_ids = model.generate(**inputs,
-                                                           max_new_tokens=128,
-                                                           do_sample=True)
-                            generated_text = processor.batch_decode(
-                                generated_ids, skip_special_tokens=True)
-                            generated_text_candidates_single_chunk[i] += [
-                                '. '.join([txt.strip() for txt in generated_text])
-                            ]
-
-                # 3. insert a list of generated captions into the positions of
-                # subsequent placeholders in the original string
-                new_generated_text_all_videos = [
-                    [] for _ in range(self.num_newly_generated_samples)
-                ]
-                # new_generated_text_all_videos is a helper array,
-                # element [i][j]
-                # denotes the reduced $i$-th result for the $j$-th video
-
-                # reduce the captions according to given mode video by video
-                for j in range(video_count):
-                    new_generated_text_per_video = self._reduce_captions(
-                        chunk,
-                        [
-                            captions[j] for captions in
-                            generated_text_candidates_single_chunk
-                        ],
-                    )
-                    assert self.num_newly_generated_samples == len(
-                        new_generated_text_per_video)
-                    for i in range(len(new_generated_text_per_video)):
-                        new_generated_text_all_videos[i].append(
-                            new_generated_text_per_video[i])
-
-                # insert the captions according to given mode
-                place_holders = [SpecialTokens.video] * video_count
-                for i in range(self.num_newly_generated_samples):
-                    generated_text_per_chunk = insert_texts_after_placeholders(
-                        original_string=text_with_only_special_tokens,
-                        placeholders=place_holders,
-                        new_texts=new_generated_text_all_videos[i],
-                    )
-                    generated_samples[i][
-                        self.
-                        text_key] += f'{generated_text_per_chunk}' \
-                                     f'{SpecialTokens.eoc}'
-
-                offset += video_count
-
-        if not context:
+                        indices = np.linspace(0, len(frames)-1, self.frame_num, dtype=int)
+                        inp = [frames[i] for i in indices]
+


The legacy code path (legacy_split_by_text_token=True) has a critical bug when frame_field is used. If frame_field is provided, load_video_data becomes False, and loaded_video_keys is not initialized, leading to a NameError at line 410. Subsequently, idx is used at line 414 without being defined in this scope.

Given this is a legacy path, the safest solution would be to disallow using frame_field in this mode. Please consider adding a check in __init__ to raise a ValueError if both legacy_split_by_text_token and frame_field are set, and update the docstring for legacy_split_by_text_token to reflect this limitation.

gemini-code-assist · 2025-12-23T08:59:37Z

data_juicer/ops/mapper/video_captioning_from_frames_mapper.py

+        keys = set(list(samples_after_generation[0].keys()) + list(keys_after_process))
        res_samples = {}
        for key in keys:
            res_samples[key] = [s[key] for s in samples_after_generation]


This logic for reconstructing samples is buggy and can lead to errors:

It can raise an IndexError if samples_after_generation is empty.

It will raise a KeyError if a sample (e.g., an original sample) does not contain a new key that was added to other generated samples.

The logic should be refactored to be more robust by iterating over all samples to collect all keys and using .get() for safe access.

Suggested change

keys = set(list(samples_after_generation[0].keys()) + list(keys_after_process))

res_samples = {}

for key in keys:

res_samples[key] = [s[key] for s in samples_after_generation]

if not samples_after_generation:

return {key: [] for key in samples.keys()}

keys = set()

for s in samples_after_generation:

keys.update(s.keys())

res_samples = {}

for key in keys:

res_samples[key] = [s.get(key) for s in samples_after_generation]

gemini-code-assist · 2025-12-23T08:59:37Z

data_juicer/ops/mapper/video_captioning_from_frames_mapper.py

+        :param text_update_strategy: strategy to update the text field after caption
+            generation. Can be one of ['keep_origin', 'rewrite', 'tile'].
+            'keep_origin': keep the original text unchanged.
+            'rewrite': rewrite the text field with the generated captions concated by special tokens.


The documentation for text_update_strategy mentions 'tile' as a possible value but does not explain what it does. Please add a description for the 'tile' option or remove it from the list if it's not supported.

gemini-code-assist · 2025-12-23T08:59:37Z

data_juicer/ops/mapper/video_captioning_from_frames_mapper.py

+            captions_list = []
+            for i in range(self.num_newly_generated_samples):
+                captions_list.append(new_generated_text_all_videos[i])
+            return captions_list


This block of code creates a new list captions_list by appending elements from new_generated_text_all_videos. Since new_generated_text_all_videos is already in the desired list-of-lists format, you can directly return it to make the code more concise and efficient.

Suggested change

captions_list = []

for i in range(self.num_newly_generated_samples):

captions_list.append(new_generated_text_all_videos[i])

return captions_list

return new_generated_text_all_videos

gemini-code-assist · 2025-12-23T08:59:37Z

tests/ops/mapper/test_video_captioning_from_frames_mapper.py

+        self.assertNotEqual(result[0]['text'], ds_list[0]['text'])
+        self.assertTrue(result[0]['text'].startswith(SpecialTokens.video))
+        self.assertTrue(result[0]['text'].endswith(SpecialTokens.eoc))
+        self.assertTrue(result[0]['text'].count(SpecialTokens.video), 2)


self.assertTrue is used here with an integer count. While assertTrue(2) evaluates to true, it doesn't actually check if the count is 2. You should use self.assertEqual to assert that the count is exactly 2.

Suggested change

self.assertTrue(result[0]['text'].count(SpecialTokens.video), 2)

self.assertEqual(result[0]['text'].count(SpecialTokens.video), 2)

update video_captioning_from_frames_mapper

af8e7df

Cathy0908 requested a deployment to Testing December 23, 2025 08:56 — with GitHub Actions Waiting

gemini-code-assist bot reviewed Dec 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] update video ops #867

[WIP] update video ops #867

Uh oh!

Cathy0908 commented Dec 23, 2025

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	self.assertTrue(result[0]['text'].count(SpecialTokens.video), 2)
	self.assertEqual(result[0]['text'].count(SpecialTokens.video), 2)

[WIP] update video ops #867

Are you sure you want to change the base?

[WIP] update video ops #867

Uh oh!

Conversation

Cathy0908 commented Dec 23, 2025

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants