-
Notifications
You must be signed in to change notification settings - Fork 307
[WIP] update video ops #867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @Cathy0908, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant updates to the video captioning mapper, adding support for pre-extracted frames and offering more flexible caption handling. The refactoring into helper methods is a good step towards better code organization. However, I've identified several critical issues in the implementation. The legacy code path is broken when used with the new frame_field parameter, and the batch processing logic has flaws that can lead to runtime errors. Additionally, there are some minor issues with documentation and tests. Addressing these issues is crucial for the stability and correctness of this operator.
| for video_key in loaded_video_keys[offset:offset + video_count]: | ||
| if load_video_data: | ||
| inp = videos[video_key] | ||
| else: | ||
| frames = ori_sample[self.frame_field][idx] | ||
| # select frames_num frames from inp frames | ||
| if self.frame_num >= len(frames): | ||
| inp = frames | ||
| else: | ||
| prompt_texts = None | ||
|
|
||
| inputs = processor( | ||
| text=prompt_texts, | ||
| images=video_frame_videos_chunk, | ||
| return_tensors='pt', | ||
| ).to(model.device) | ||
| with torch.no_grad(): | ||
| for i in range(self.caption_num): | ||
| generated_ids = model.generate(**inputs, | ||
| max_new_tokens=128, | ||
| do_sample=True) | ||
| generated_text = processor.batch_decode( | ||
| generated_ids, skip_special_tokens=True) | ||
| generated_text_candidates_single_chunk[i] += [ | ||
| '. '.join([txt.strip() for txt in generated_text]) | ||
| ] | ||
|
|
||
| # 3. insert a list of generated captions into the positions of | ||
| # subsequent placeholders in the original string | ||
| new_generated_text_all_videos = [ | ||
| [] for _ in range(self.num_newly_generated_samples) | ||
| ] | ||
| # new_generated_text_all_videos is a helper array, | ||
| # element [i][j] | ||
| # denotes the reduced $i$-th result for the $j$-th video | ||
|
|
||
| # reduce the captions according to given mode video by video | ||
| for j in range(video_count): | ||
| new_generated_text_per_video = self._reduce_captions( | ||
| chunk, | ||
| [ | ||
| captions[j] for captions in | ||
| generated_text_candidates_single_chunk | ||
| ], | ||
| ) | ||
| assert self.num_newly_generated_samples == len( | ||
| new_generated_text_per_video) | ||
| for i in range(len(new_generated_text_per_video)): | ||
| new_generated_text_all_videos[i].append( | ||
| new_generated_text_per_video[i]) | ||
|
|
||
| # insert the captions according to given mode | ||
| place_holders = [SpecialTokens.video] * video_count | ||
| for i in range(self.num_newly_generated_samples): | ||
| generated_text_per_chunk = insert_texts_after_placeholders( | ||
| original_string=text_with_only_special_tokens, | ||
| placeholders=place_holders, | ||
| new_texts=new_generated_text_all_videos[i], | ||
| ) | ||
| generated_samples[i][ | ||
| self. | ||
| text_key] += f'{generated_text_per_chunk}' \ | ||
| f'{SpecialTokens.eoc}' | ||
|
|
||
| offset += video_count | ||
|
|
||
| if not context: | ||
| indices = np.linspace(0, len(frames)-1, self.frame_num, dtype=int) | ||
| inp = [frames[i] for i in indices] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The legacy code path (legacy_split_by_text_token=True) has a critical bug when frame_field is used. If frame_field is provided, load_video_data becomes False, and loaded_video_keys is not initialized, leading to a NameError at line 410. Subsequently, idx is used at line 414 without being defined in this scope.
Given this is a legacy path, the safest solution would be to disallow using frame_field in this mode. Please consider adding a check in __init__ to raise a ValueError if both legacy_split_by_text_token and frame_field are set, and update the docstring for legacy_split_by_text_token to reflect this limitation.
| keys = set(list(samples_after_generation[0].keys()) + list(keys_after_process)) | ||
| res_samples = {} | ||
| for key in keys: | ||
| res_samples[key] = [s[key] for s in samples_after_generation] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic for reconstructing samples is buggy and can lead to errors:
- It can raise an
IndexErrorifsamples_after_generationis empty. - It will raise a
KeyErrorif a sample (e.g., an original sample) does not contain a new key that was added to other generated samples.
The logic should be refactored to be more robust by iterating over all samples to collect all keys and using .get() for safe access.
| keys = set(list(samples_after_generation[0].keys()) + list(keys_after_process)) | |
| res_samples = {} | |
| for key in keys: | |
| res_samples[key] = [s[key] for s in samples_after_generation] | |
| if not samples_after_generation: | |
| return {key: [] for key in samples.keys()} | |
| keys = set() | |
| for s in samples_after_generation: | |
| keys.update(s.keys()) | |
| res_samples = {} | |
| for key in keys: | |
| res_samples[key] = [s.get(key) for s in samples_after_generation] |
| :param text_update_strategy: strategy to update the text field after caption | ||
| generation. Can be one of ['keep_origin', 'rewrite', 'tile']. | ||
| 'keep_origin': keep the original text unchanged. | ||
| 'rewrite': rewrite the text field with the generated captions concated by special tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| captions_list = [] | ||
| for i in range(self.num_newly_generated_samples): | ||
| captions_list.append(new_generated_text_all_videos[i]) | ||
| return captions_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code creates a new list captions_list by appending elements from new_generated_text_all_videos. Since new_generated_text_all_videos is already in the desired list-of-lists format, you can directly return it to make the code more concise and efficient.
| captions_list = [] | |
| for i in range(self.num_newly_generated_samples): | |
| captions_list.append(new_generated_text_all_videos[i]) | |
| return captions_list | |
| return new_generated_text_all_videos |
| self.assertNotEqual(result[0]['text'], ds_list[0]['text']) | ||
| self.assertTrue(result[0]['text'].startswith(SpecialTokens.video)) | ||
| self.assertTrue(result[0]['text'].endswith(SpecialTokens.eoc)) | ||
| self.assertTrue(result[0]['text'].count(SpecialTokens.video), 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.assertTrue is used here with an integer count. While assertTrue(2) evaluates to true, it doesn't actually check if the count is 2. You should use self.assertEqual to assert that the count is exactly 2.
| self.assertTrue(result[0]['text'].count(SpecialTokens.video), 2) | |
| self.assertEqual(result[0]['text'].count(SpecialTokens.video), 2) |
No description provided.