Growing up in India, I often saw plastic bottles, cans, wrappers, and recyclables littering the streets. It always felt like a solvable problem, what if only technology could lend a hand (literally)
This project was born from that simple idea:
What if a robotic arm could autonomously identify and pick up recyclables, cleaning our environment, one object at a time?
lang2pick is a step toward that: by using an open-source arm (SO-101) we then combines natural language understanding, vision-language-action modes, and motion planning to enable real-world pick-and-place tasks
SO-101 ROS2 is an experiment for building general-purpose robotic manipulators using the SO-101 robotic arm. It enables natural language-driven pick-and-place operations via a complete software stack:
Example Command:
“Pick up all recyclables and place them in the blue recycling bin”
The system bridges the full pipeline:
Language → Perception → Action Planning → Hardware Execution
Provide developers with a plug-and-play platform to:
- Fine-tune Vision-Language-Action (VLA) models
- Control any ROS2-compatible robotic arm via
ros2_control - Perform robust pick-and-place tasks in simulation and reality (sim-to-real)
%%{init: {'theme': 'neutral', 'themeVariables': {
'primaryColor': '#ffffff',
'edgeLabelBackground':'#ffffff',
'fontSize': '14px'
}}}%%
graph TD
A["Natural Language Command"]
F["RGB-D Camera (Perception)"]
B{"Vision Language Action Model"}
C["MoveIt 2 Motion Planner"]
D["ros2_control interface"]
E["SO-101 Arm + Grippers"]
A --> B
F --> B
B -->|"Target Object & Action Tokens"| C
C -->|"Optimized Joint Trajectories"| D
D --> E
%% Styling (consistent look)
style A fill:#e1f5fe,stroke:#333,stroke-width:1px
style B fill:#ffccbc,stroke:#333,stroke-width:1px
style C fill:#fff3e0,stroke:#333,stroke-width:1px
style D fill:#e0f7fa,stroke:#333,stroke-width:1px
style E fill:#c8e6c9,stroke:#333,stroke-width:1px
style F fill:#fce4ec,stroke:#333,stroke-width:1px
- Hardware interface for SO101 arm
- Connect with MoveIt 2 planner
- Write a modular Python framework for VLM object detection
- Implement a gRPC server to send perception commands to the robot
- Create a ROS 2 ↔ gRPC bridge
- Stream the world-frame video using WebRTC
- Build front-end to interact with VLM and display current picking status
- Automate deployment to the cloud
| Directory | Description |
|---|---|
ros2_ws/ |
ROS2 workspace containing robot description, MoveIt2 configuration, controller setup, hardware interface nodes and simulation |
vla/ |
Vision-Language(-Action) module — converts VLA outputs (object/action tokens) into ROS2 commands for MoveIt2 |
scripts/ |
Training and fine-tuning pipeline for the Vision-Language model (using PyTorch and LeRobot) |
docs/ |
Documentation, diagrams, and setup guides for developers and contributors |
- ROS2 Humble — Core robotics framework
- MoveIt2 — Inverse kinematics and motion planning
- PyTorch + LeRobot — Vision-Language training & fine-tuning
- Gazebo / MuJoCo Sim — Physics simulation and visualization
Contributions are welcome! Whether you want to help with ROS2 development, dataset collection, or model training — feel free to open an issue or a PR.
This project is open-source and licensed under the Apache License.
This project builds on the shoulders of open-source giants —
MoveIt2, ROS2, PyTorch, LeRobot, and the amazing open-source robotics community.


