Skip to content

Conversation

@archernsy
Copy link

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @archernsy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the video processing functionalities by integrating two advanced computer vision operators. The DepthAnythingV2Mapper enables the generation of detailed depth maps from video content, providing crucial 3D scene understanding. Concurrently, the SAMAutoMaskMapper introduces a robust automatic object segmentation pipeline, utilizing SAM1 for initial object detection on keyframes and SAM2 for efficient, temporally consistent mask propagation throughout the video. These additions broaden the scope of video data enrichment and analysis within the framework.

Highlights

  • New Depth Estimation Operator: Introduces DepthAnythingV2Mapper for generating detailed depth maps from video frames using the Depth Anything V2 model, enhancing 3D scene understanding capabilities.
  • New Automatic Mask Generation Operator: Adds SAMAutoMaskMapper which leverages SAM1 for initial mask detection on keyframes and SAM2 for efficient, temporally consistent mask propagation throughout video sequences.
  • Integration of Advanced Computer Vision Models: Incorporates complex and state-of-the-art model architectures such as DINOv2, DPT, SAM1, and SAM2, along with their respective utility components and custom CUDA kernels, into the data processing framework.
  • Enhanced Video Processing Capabilities: Significantly expands the library's functionalities to include sophisticated video analysis tasks, providing tools for both depth perception and robust object segmentation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yxdyc yxdyc requested review from Cathy0908 and Qirui-jiao and removed request for Copilot December 22, 2025 07:29
@yxdyc yxdyc added the dj:op issues/PRs about some specific OPs label Dec 22, 2025
@archernsy
Copy link
Author

This PR introduces two new operators:

DepthAnythingv2: For single-frame depth extraction from videos.
Autoseg2: A fully automatic video segmentation tool (based on SAM1 and SAM2). It can automatically segment all content in a video, dividing frames into different objects or regions without requiring manual prompts.
All feedback and suggestions are welcome

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new operators, DepthAnythingV2Mapper and SAMAutoMaskMapper, for depth estimation and automatic video segmentation. The changes are extensive and involve adding a significant amount of model code, seemingly from external repositories. My review focuses on the integration of these models as data_juicer operators. I've identified several critical and high-severity issues related to dependency management (cloning git repos at runtime), hardcoded absolute paths, performance (inefficient model loading), and potential bugs (incorrect sorting). These issues could affect the portability, maintainability, and performance of the new operators and should be addressed.

self,
sam1_checkpoint_path: str = "./checkpoints/sam1/sam_vit_h_4b8939.pth",
sam2_checkpoint_path: str = "./checkpoints/sam2/sam2_hiera_large.pt",
sam2_model_cfg: str = "/mnt/workspace/junyuanxiao/code/AutoSeg-SAM2/sam2_configs/sam2_hiera_l.yaml",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The default value for sam2_model_cfg is a hardcoded absolute path to a specific user's workspace (/mnt/workspace/junyuanxiao/code/AutoSeg-SAM2/sam2_configs/sam2_hiera_l.yaml). This will cause the code to fail on any other machine. This should be changed to a relative path, a path configurable via an environment variable, or downloaded to a cache directory. The logic in _load_models to find this config file also contains several hardcoded paths and is brittle. A more robust way to handle model configurations and dependencies is needed.

Comment on lines +187 to +194
def infer_image(self, raw_image, input_size=518):
image, (h, w) = self.image2tensor(raw_image, input_size)

depth = self.forward(image)

depth = F.interpolate(depth[:, None], (h, w), mode="bilinear", align_corners=True)[0, 0]

return depth.cpu().numpy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To fix the hardcoded device issue in image2tensor, this method should determine the model's device and pass it down.

Suggested change
def infer_image(self, raw_image, input_size=518):
image, (h, w) = self.image2tensor(raw_image, input_size)
depth = self.forward(image)
depth = F.interpolate(depth[:, None], (h, w), mode="bilinear", align_corners=True)[0, 0]
return depth.cpu().numpy()
def infer_image(self, raw_image, input_size=518):
device = next(self.parameters()).device
image, (h, w) = self.image2tensor(raw_image, input_size, device=device)
depth = self.forward(image)
depth = F.interpolate(depth[:, None], (h, w), mode="bilinear", align_corners=True)[0, 0]
return depth.cpu().numpy()

Comment on lines +167 to +182
from .depth_anything_v2.dpt import DepthAnythingV2

# Load model
depth_anything = DepthAnythingV2(**self.model_configs[self.encoder])
model_path = os.path.join(self.checkpoint_path, f"depth_anything_v2_{self.encoder}.pth")

if not os.path.exists(model_path):
raise FileNotFoundError(f"Model checkpoint not found at {model_path}")

depth_anything.load_state_dict(torch.load(model_path, map_location='cpu'))

device = "cuda" if self.use_cuda() else "cpu"
if rank is not None and self.use_cuda():
device = f"cuda:{rank}"

depth_anything = depth_anything.to(device).eval()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The model is loaded within the process_single method. This is inefficient as it will be reloaded for every sample processed by this worker if the result is not already cached. This can significantly slow down processing. The model should be loaded once per worker, for example in the __init__ method, and stored as an instance attribute (e.g., self.model). A helper method can be used for lazy loading on the first call if you want to avoid loading the model during initialization of all workers.

Comment on lines +196 to +221
def image2tensor(self, raw_image, input_size=518):
transform = Compose([
Resize(
width=input_size,
height=input_size,
resize_target=False,
keep_aspect_ratio=True,
ensure_multiple_of=14,
resize_method='lower_bound',
image_interpolation_method=cv2.INTER_CUBIC,
),
NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
PrepareForNet(),
])

h, w = raw_image.shape[:2]

image = cv2.cvtColor(raw_image, cv2.COLOR_BGR2RGB) / 255.0

image = transform({'image': image})['image']
image = torch.from_numpy(image).unsqueeze(0)

DEVICE = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
image = image.to(DEVICE)

return image, (h, w)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The image2tensor method hardcodes the device selection. This can cause device mismatches if the model is on a different device (e.g., a specific GPU like cuda:1). The device should be passed in as an argument from the caller. This makes the component more modular and prevents potential runtime errors.

    def image2tensor(self, raw_image, input_size=518, device='cpu'):        
        transform = Compose([
            Resize(
                width=input_size,
                height=input_size,
                resize_target=False,
                keep_aspect_ratio=True,
                ensure_multiple_of=14,
                resize_method='lower_bound',
                image_interpolation_method=cv2.INTER_CUBIC,
            ),
            NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
            PrepareForNet(),
        ])
        
        h, w = raw_image.shape[:2]
        
        image = cv2.cvtColor(raw_image, cv2.COLOR_BGR2RGB) / 255.0
        
        image = transform({'image': image})['image']
        image = torch.from_numpy(image).unsqueeze(0)
        
        image = image.to(device)
        
        return image, (h, w)

Comment on lines +96 to +103
if not os.path.exists(depth_anything_v2_repo_path):
import subprocess
subprocess.run([
"git",
"clone",
"https://github.com/DepthAnything/Depth-Anything-V2.git",
depth_anything_v2_repo_path,
], check=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This operator clones a git repository (Depth-Anything-V2) into a cache directory during initialization. This introduces several potential issues:

  1. Side Effects: It's an unexpected side effect for an operator to perform network operations and write to the filesystem outside of its designated output directories.
  2. Dependencies: It creates a dependency on git being installed and network access to GitHub.
  3. Fragility: The clone might fail due to network issues, permissions, or if the repository is moved or deleted.
    It would be more robust to either vendor the necessary code within this repository or treat it as a proper dependency that users must install beforehand, with clear instructions in the documentation.

Comment on lines +195 to +202
# w0, h0 = w0 + 0.1, h0 + 0.1

sqrt_N = math.sqrt(N)
sx, sy = float(w0) / sqrt_N, float(h0) / sqrt_N
patch_pos_embed = nn.functional.interpolate(
patch_pos_embed.reshape(1, int(sqrt_N), int(sqrt_N), dim).permute(0, 3, 1, 2),
scale_factor=(sx, sy),
# (int(w0), int(h0)), # to solve the upsampling shape issue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are commented-out lines of code here that seem to be leftovers from development or debugging. They should be removed to improve code clarity.

)
elif self.training and self.sample_drop_ratio > 0.0:
x = x + self.drop_path1(attn_residual_func(x))
x = x + self.drop_path1(ffn_residual_func(x)) # FIXME: drop_path2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a FIXME comment here, and it seems correct. You are using self.drop_path1 for the FFN residual path, but it should probably be self.drop_path2. Using the same drop path instance might not be intended and could be a bug if drop probabilities are different (though here they are the same). It's better to use the correct one for clarity and correctness.

Suggested change
x = x + self.drop_path1(ffn_residual_func(x)) # FIXME: drop_path2
x = x + self.drop_path2(ffn_residual_func(x))

checkpoint_path: str = "./checkpoints",
encoder: str = "vitl",
input_size: int = 518,
frame_num: int = 10, # 改为 int 类型,默认值改为正整数
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is in Chinese, while the rest of the docstring and code is in English. This is likely a leftover development note. For consistency and clarity, it should be removed as the docstring already explains the frame_num parameter.

Suggested change
frame_num: int = 10, # 改为 int 类型,默认值改为正整数
frame_num: int = 10,


# Add the repository path to sys.path
import sys
sys.path.append(auto_seg_sam2_repo_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Modifying sys.path at runtime is generally discouraged as it can lead to unpredictable behavior and makes dependency management difficult. It's better to structure the project so that such modifications are not necessary, for example by vendoring the dependency or installing it as a proper package.

Comment on lines +408 to +410
"""
准备符合 SAM2 要求的帧文件(创建临时目录并复制重命名文件)
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for this method is in Chinese, which is inconsistent with the English used in the rest of the file. Please translate it for better maintainability.

Suggested change
"""
准备符合 SAM2 要求的帧文件创建临时目录并复制重命名文件
"""
"""
Prepare frame files in a format required by SAM2 (create a temporary
directory and copy/rename files).
"""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dj:op issues/PRs about some specific OPs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants