A multi-agent data processing system built on AgentScope and Data-Juicer (DJ). This project demonstrates how to leverage the natural language understanding capabilities of large language models, enabling non-expert users to easily harness the powerful data processing capabilities of Data-Juicer.
In the actual work of large model R&D and applications, data processing remains a high-cost, low-efficiency, and hard-to-reproduce process. Many teams spend more time on data analysis, cleaning and synthesis than on model training, requirement alignment and app development.
We hope to liberate developers from tedious script assembly through agent technology, making data R&D closer to a "think and get" experience.
Data directly defines the upper limit of model capabilities. What truly determines model performance are multiple dimensions such as quality, diversity, harmfulness control, and task matching of data. Optimizing data is essentially optimizing the model itself. To do this efficiently, we need a systematic toolset.
DataJuicer Agents is designed to support the new paradigm of data-model co-optimization as an intelligent collaboration system.
- DataJuicer Agents
Data-Juicer (DJ) is an open-source processing system covering the full lifecycle of large model data, providing four core capabilities:
- Full-Stack Operator Library (DJ-OP): Nearly 200 high-performance, reusable multimodal operators covering text, images, and audio/video
- High-Performance Engine (DJ-Core): Built on Ray, supporting TB-level data, 10K-core distributed computing, with operator fusion and multi-granularity fault tolerance
- Collaborative Development Platform (DJ-Sandbox): Introduces A/B Test and Scaling Law concepts, using small-scale experiments to drive large-scale optimization
- Natural Language Interaction Layer (DJ-Agents): Enables developers to build data pipelines through conversational interfaces using Agent technology
DataJuicer Agents is not a simple Q&A bot, but an intelligent collaborator for data processing. Specifically, it can:
- Intelligent Query: Automatically match the most suitable operators based on natural language descriptions (precisely locating from nearly 200 operators)
- Automated Pipeline: Describe data processing needs, automatically generate Data-Juicer YAML configurations and execute them
- Custom Extension: Help users develop custom operators and seamlessly integrate them into local environments
Our goal: Let developers focus on "what to do" rather than "how to do it".
DataJuicer Agents adopts a multi-agent routing architecture, which is key to system scalability. When a user inputs a natural language request, the Router Agent first performs task triage to determine whether it's a standard data processing task or a custom requirement that needs new capabilities.
User Query
β
Router Agent (Filtering & Decision) β query_dj_operators (operator retrieval)
β
ββ High-match operator found
β β
β DJ Agent (Standard Data Processing Task)
| βββ Preview data samples (confirm field names and data formats)
β βββ get_ops_signature (retrieve full parameter signatures)
β βββ Generate YAML configuration
β βββ execute_safe_command (run dj-process, dj-analyze)
β
ββ No high-match operator found
β
Dev Agent (Custom Operator Development & Integration)
βββ get_basic_files (retrieve base classes and registration mechanism)
βββ get_operator_example (retrieve similar operator examples)
βββ Generate compliant operator code
βββ Local integration (register to user-specified path)
Agent integration with DataJuicer has two modes to adapt to different usage scenarios:
- Tool Binding Mode: Agent calls DataJuicer command-line tools (such as
dj-analyze,dj-process), compatible with existing user habits, low migration cost - MCP Binding Mode: Agent directly calls DataJuicer's MCP (Model Context Protocol) interface, no need to generate intermediate YAML files, directly run operators or data recipes, better performance
These two modes are automatically selected by the Agent based on task complexity and performance requirements, ensuring both flexibility and efficiency.
The Data-Juicer agent ecosystem is rapidly expanding. Here are the new agents currently in development or planned:
Provides users with detailed answers about Data-Juicer operators, concepts, and best practices.
The Q&A agent can currently be viewed and tried out here.
We are building a more advanced human-machine collaborative data optimization workflow that introduces human feedback:
- Users can view statistics, attribution analysis, and visualization results
- Dynamically edit recipes, approve or reject suggestions
- Underpinned by
dj.analyzer(data analysis),dj.attributor(effect attribution), anddj.sandbox(experiment management) - Supports closed-loop optimization based on validation tasks
This interactive recipe can currently be viewed and tried out here.
- Data Processing Agent Benchmarking: Quantify the performance of different Agents in terms of accuracy, efficiency, and robustness
- Data "Health Check Report" & Data Intelligent Recommendation: Automatically diagnose data problems and recommend optimization solutions
- Router Agent Enhancement: More seamless, e.g., when operators are lacking β Code Development Agent β Data Processing Agent
- MCP Further Optimization: Embedded LLM, users can directly use MCP connected to their local environment (e.g., IDE) to get an experience similar to current data processing agents
- Knowledge Base and RAG-oriented Data Agents
- Better Automatic Processing Solution Generation: Less token usage, more efficient, higher quality processing results
- Data Workflow Template Reuse and Automatic Tuning: Based on DataJuicer community data recipes
- ......
Q: How to get DashScope API key? A: Visit DashScope official website to register an account and apply for an API key.
Q: Why does operator retrieval fail? A: Please check network connection and API key configuration, or try switching to vector retrieval mode.
Q: How to debug custom operators? A: Ensure Data-Juicer path is configured correctly and check the example code provided by the code development agent.
Q: What to do if MCP service connection fails? A: Check if the MCP server is running and confirm the URL address in the configuration file is correct.
Q: Error: requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:3000/trpc/pushMessage A: Agents handle data via file references (paths) rather than direct uploads. Please confirm whether any non-text files were submitted.
- For large-scale data processing, it is recommended to use DataJuicer's distributed mode
- Set batch size appropriately to balance memory usage and processing speed
- For more advanced data processing features (synthesis, Data-Model Co-Development), please refer to DataJuicer documentation
- DataJuicer has been used by a large number of Tongyi and Alibaba Cloud internal and external users, and has facilitated many research works. All code is continuously maintained and enhanced.
Welcome to visit GitHub, Star, Fork, submit Issues, and join the community!
- Project Repositories:
Contributing: Welcome to submit Issues and Pull Requests to improve AgentScope, DataJuicer Agents, and DataJuicer. If you encounter problems during use or have feature suggestions, please feel free to contact us.