Skip to content

feat: Edge deployment optimization with GPU-native BEV, modern CUDA ops, and gradient accumulation #12

Open
fthbng77 wants to merge 5 commits intoTRV-Lab:masterfrom
fthbng77:master
Open

feat: Edge deployment optimization with GPU-native BEV, modern CUDA ops, and gradient accumulation #12
fthbng77 wants to merge 5 commits intoTRV-Lab:masterfrom
fthbng77:master

Conversation

@fthbng77
Copy link
Copy Markdown

Summary

This PR introduces several improvements aimed at edge deployment optimization
and modern PyTorch compatibility:

  • GPU-native GridDensityBEV module: Replaces CQCA_cfa's CPU-based DBSCAN
    clustering with a GPU-native grid density approach for edge deployment scenarios
  • Modern CUDA ops compatibility: Replaces deprecated THC/THC.h headers with
    ATen/cuda/CUDAContext.h across all CUDA extension files, removing obsolete
    extern THCState declarations for PyTorch 2.x support
  • Gradient accumulation support: Enables training on memory-constrained GPUs
    (8GB VRAM) by reducing batch size to 1 with 4 accumulation steps
  • Cleanup: Removes pre-compiled .so binaries from tracking (should be built
    from source per environment)

Changes

  • pcdet/models/backbones_image/CQCA_cfa.py — New GridDensityBEV module
  • pcdet/ops/**/src/*.cpp — Updated CUDA extension headers for PyTorch 2.x
  • tools/train_utils/train_utils.py — Gradient accumulation in training loop
  • tools/cfgs/MAFF-Net/MAFF-Net_vod.yaml — Config updates for edge deployment
  • Removed all pre-compiled .so files from version control

Motivation

  • Enable MAFF-Net deployment on edge devices with limited GPU resources
  • Fix build failures with PyTorch >= 2.0 due to deprecated THC headers
  • Allow training on consumer GPUs (e.g., RTX 4060 8GB) via gradient accumulation

fthbng77 and others added 5 commits February 15, 2026 14:36
- Replace deprecated THC/THC.h with ATen/cuda/CUDAContext.h in all CUDA extension cpp files
- Remove obsolete extern THCState declarations
- Update VoD dataset paths to match local directory structure
- Add .gitignore for build artifacts, data, and IDE files
- Add anchor visualization tool

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reduce BATCH_SIZE_PER_GPU to 1 with ACCUMULATION_STEPS=4 for 8GB VRAM
- Implement gradient accumulation in train_one_epoch
- Remove compiled .so files from git (already in .gitignore)
…fa's CPU-based DBSCAN for edge optimization, updating the image backbone configuration and documentation.
…ataset, leveraging GridDensityBEV, AMP, and a dedicated configuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant