Skip to content

Conversation

@Cathy0908
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Cathy0908, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the VideoCaptioningFromFramesMapper by introducing more granular control over how video frames are sourced and how generated captions are integrated into the dataset. It provides new configuration options for specifying frame fields, managing text updates, and handling legacy processing methods, aiming to enhance flexibility and prepare for future architectural changes.

Highlights

  • Enhanced Video Frame Processing: The VideoCaptioningFromFramesMapper now supports processing frames directly from a specified frame_field in the sample, offering an alternative to extracting frames from video files. This provides greater flexibility in how video data is handled.
  • Flexible Text Update Strategies: New text_update_strategy and caption_field parameters have been introduced, allowing users to choose how generated captions update the text field. Options include keeping the original text, rewriting it with new captions, or saving captions to a separate field.
  • Refactored Core Logic: The internal logic for generating captions from frames and processing them has been refactored into new private methods (_gen_caption_from_frames and _process_captions), improving modularity and readability.
  • Deprecation Warning for Legacy Behavior: A legacy_split_by_text_token parameter is added, and a warning is issued when it's set to True, indicating that this behavior will be deprecated in future versions in favor of direct 'videos' or 'frames' field usage.
  • New Validation and Test Cases: Additional validation has been added for parameter combinations (e.g., caption_num with keep_candidate_mode, caption_field with text_update_strategy). Corresponding new test cases have been added to cover these new functionalities and ensure correctness.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant updates to the video captioning mapper, adding support for pre-extracted frames and offering more flexible caption handling. The refactoring into helper methods is a good step towards better code organization. However, I've identified several critical issues in the implementation. The legacy code path is broken when used with the new frame_field parameter, and the batch processing logic has flaws that can lead to runtime errors. Additionally, there are some minor issues with documentation and tests. Addressing these issues is crucial for the stability and correctness of this operator.

Comment on lines +410 to +421
for video_key in loaded_video_keys[offset:offset + video_count]:
if load_video_data:
inp = videos[video_key]
else:
frames = ori_sample[self.frame_field][idx]
# select frames_num frames from inp frames
if self.frame_num >= len(frames):
inp = frames
else:
prompt_texts = None

inputs = processor(
text=prompt_texts,
images=video_frame_videos_chunk,
return_tensors='pt',
).to(model.device)
with torch.no_grad():
for i in range(self.caption_num):
generated_ids = model.generate(**inputs,
max_new_tokens=128,
do_sample=True)
generated_text = processor.batch_decode(
generated_ids, skip_special_tokens=True)
generated_text_candidates_single_chunk[i] += [
'. '.join([txt.strip() for txt in generated_text])
]

# 3. insert a list of generated captions into the positions of
# subsequent placeholders in the original string
new_generated_text_all_videos = [
[] for _ in range(self.num_newly_generated_samples)
]
# new_generated_text_all_videos is a helper array,
# element [i][j]
# denotes the reduced $i$-th result for the $j$-th video

# reduce the captions according to given mode video by video
for j in range(video_count):
new_generated_text_per_video = self._reduce_captions(
chunk,
[
captions[j] for captions in
generated_text_candidates_single_chunk
],
)
assert self.num_newly_generated_samples == len(
new_generated_text_per_video)
for i in range(len(new_generated_text_per_video)):
new_generated_text_all_videos[i].append(
new_generated_text_per_video[i])

# insert the captions according to given mode
place_holders = [SpecialTokens.video] * video_count
for i in range(self.num_newly_generated_samples):
generated_text_per_chunk = insert_texts_after_placeholders(
original_string=text_with_only_special_tokens,
placeholders=place_holders,
new_texts=new_generated_text_all_videos[i],
)
generated_samples[i][
self.
text_key] += f'{generated_text_per_chunk}' \
f'{SpecialTokens.eoc}'

offset += video_count

if not context:
indices = np.linspace(0, len(frames)-1, self.frame_num, dtype=int)
inp = [frames[i] for i in indices]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The legacy code path (legacy_split_by_text_token=True) has a critical bug when frame_field is used. If frame_field is provided, load_video_data becomes False, and loaded_video_keys is not initialized, leading to a NameError at line 410. Subsequently, idx is used at line 414 without being defined in this scope.

Given this is a legacy path, the safest solution would be to disallow using frame_field in this mode. Please consider adding a check in __init__ to raise a ValueError if both legacy_split_by_text_token and frame_field are set, and update the docstring for legacy_split_by_text_token to reflect this limitation.

Comment on lines +517 to 520
keys = set(list(samples_after_generation[0].keys()) + list(keys_after_process))
res_samples = {}
for key in keys:
res_samples[key] = [s[key] for s in samples_after_generation]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This logic for reconstructing samples is buggy and can lead to errors:

  1. It can raise an IndexError if samples_after_generation is empty.
  2. It will raise a KeyError if a sample (e.g., an original sample) does not contain a new key that was added to other generated samples.

The logic should be refactored to be more robust by iterating over all samples to collect all keys and using .get() for safe access.

Suggested change
keys = set(list(samples_after_generation[0].keys()) + list(keys_after_process))
res_samples = {}
for key in keys:
res_samples[key] = [s[key] for s in samples_after_generation]
if not samples_after_generation:
return {key: [] for key in samples.keys()}
keys = set()
for s in samples_after_generation:
keys.update(s.keys())
res_samples = {}
for key in keys:
res_samples[key] = [s.get(key) for s in samples_after_generation]

Comment on lines +131 to +134
:param text_update_strategy: strategy to update the text field after caption
generation. Can be one of ['keep_origin', 'rewrite', 'tile'].
'keep_origin': keep the original text unchanged.
'rewrite': rewrite the text field with the generated captions concated by special tokens.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation for text_update_strategy mentions 'tile' as a possible value but does not explain what it does. Please add a description for the 'tile' option or remove it from the list if it's not supported.

Comment on lines +301 to +304
captions_list = []
for i in range(self.num_newly_generated_samples):
captions_list.append(new_generated_text_all_videos[i])
return captions_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code creates a new list captions_list by appending elements from new_generated_text_all_videos. Since new_generated_text_all_videos is already in the desired list-of-lists format, you can directly return it to make the code more concise and efficient.

Suggested change
captions_list = []
for i in range(self.num_newly_generated_samples):
captions_list.append(new_generated_text_all_videos[i])
return captions_list
return new_generated_text_all_videos

self.assertNotEqual(result[0]['text'], ds_list[0]['text'])
self.assertTrue(result[0]['text'].startswith(SpecialTokens.video))
self.assertTrue(result[0]['text'].endswith(SpecialTokens.eoc))
self.assertTrue(result[0]['text'].count(SpecialTokens.video), 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

self.assertTrue is used here with an integer count. While assertTrue(2) evaluates to true, it doesn't actually check if the count is 2. You should use self.assertEqual to assert that the count is exactly 2.

Suggested change
self.assertTrue(result[0]['text'].count(SpecialTokens.video), 2)
self.assertEqual(result[0]['text'].count(SpecialTokens.video), 2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants