Skip to content

DeepAnalyze is the first agentic LLM for autonomous data science.

License

ruc-datalab/DeepAnalyze

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

23 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

DeepAnalyze

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

arXiv homepage model data star Badge wechat

twitter twitter twitter twitter twitter twitter

Authors: Shaolei Zhang, Ju Fan*, Meihao Fan, Guoliang Li, Xiaoyong Du

Renmin University of China, Tsinghua University

DeepAnalyze is the first agentic LLM for autonomous data science. It can autonomously complete a wide range of data-centric tasks without human intervention, supporting:

  • ๐Ÿ›  Entire data science pipeline: Automatically perform any data science tasks such as data preparation, analysis, modeling, visualization, and report generation.
  • ๐Ÿ” Open-ended data research: Conduct deep research on diverse data sources, including structured data (Databases, CSV, Excel), semi-structured data (JSON, XML, YAML), and unstructured data (TXT, Markdown), and finally produce analyst-grade research reports.
  • ๐Ÿ“Š Fully open-source: The model, code, training data, and demo of DeepAnalyze are all open-sourced, allowing you to deploy or extend your own data analysis assistant.

deepanalyze

๐Ÿ”ฅ News

  • [2025.10.28]: We welcome all contributions, including improving the DeepAnalyze and sharing use cases (see CONTRIBUTION.md). All merged PRs will be listed as contributors.
  • [2025.10.27]: DeepAnalyze has attracted widespread attention, gaining 1K+ GitHub stars and 200K+ Twitter views within a week.
  • [2025.10.21]: DeepAnalyze's paper, code, model, training data are released!

๐Ÿ–ฅ Demo

Upload the data, DeepAnalyze can perform data-oriented deep research ๐Ÿ” and any data-centric tasks ๐Ÿ› 

deepanalyze-8b.mp4

Tip

Clone this repository to deploy DeepAnalyze locally as your data analyst, completing any data science tasks without any workflow or closed-source APIs.

๐Ÿ”ฅ The UI of the demo is an initial version. Welcome to further develop it, and we will include you as a contributor.

  • Clone this repo and download DeepAnalyze-8B.
  • Run these scripts to launch the API and interface, and then interact through the browser (http://localhost:4000):
    cd demo/chat
    npm install
    cd ..
    bash start.sh
    
    # stop the api and interface
    bash stop.sh
  • If you want to deploy under a specific IP, please replace localhost with your IP address in ./demo/backend.py and ./demo/chat/lib/config.ts

๐Ÿš€ Quick Start

Requirements

  • Install packages: torch==2.6.0, transformers==4.53.2, vllm==0.8.5
    conda create -n deepanalyze python=3.12 -y
    conda activate deepanalyze
    pip install -r requirements.txt
    
    # For training
    (cd ./deepanalyze/ms-swift/ && pip install -e .)
    (cd ./deepanalyze/SkyRL/ && pip install -e .)

Command Interaction

  • Deploy DeepAnalyze-8B via vllm: vllm serve DeepAnalyze-8B

  • Run these scripts for any data science tasks:

    • You can specify any data science tasks, including specific data tasks and open-ended data research.
    • You can specify any number of data sources, and DeepAnalyze will automatically explore them.
    • You can specify any type of data sources, e.g., structured data (Databases, CSV, Excel), semi-structured data (JSON, XML, YAML), and unstructured data (TXT, Markdown)
    from deepanalyze import DeepAnalyzeVLLM
    
    prompt = """# Instruction
    Generate a data science report.
    
    # Data
    File 1: {"name": "bool.xlsx", "size": "4.8KB"}
    File 2: {"name": "person.csv", "size": "10.6KB"}
    File 3: {"name": "disabled.xlsx", "size": "5.6KB"}
    File 4: {"name": "enlist.csv", "size": "6.7KB"}
    File 5: {"name": "filed_for_bankrupcy.csv", "size": "1.0KB"}
    File 6: {"name": "longest_absense_from_school.xlsx", "size": "16.0KB"}
    File 7: {"name": "male.xlsx", "size": "8.8KB"}
    File 8: {"name": "no_payment_due.xlsx", "size": "15.6KB"}
    File 9: {"name": "unemployed.xlsx", "size": "5.6KB"}
    File 10: {"name": "enrolled.csv", "size": "20.4KB"}"""
    
    workspace = "/home/u2023000922/zhangshaolei/deepanalyze_public/DeepAnalyze/example/analysis_on_student_loan/"
    
    deepanalyze = DeepAnalyzeVLLM(
        "/fs/fast/u2023000922/zhangshaolei/checkpoints/deepanalyze-8b/"
    )
    answer = deepanalyze.generate(prompt, workspace=workspace)
    print(answer["reasoning"])

    You shoud get a deep research report, which can be rendered as a PDF.:

    # Comprehensive Analysis of Student Enrollment Patterns and Institutional Transfers
    
    ## Introduction and Research Context
    
    The analysis of student enrollment patterns represents a critical area of educational research with significant implications for institutional planning, resource allocation, and student support services. This comprehensive study examines a comprehensive dataset encompassing 1,194 enrollment records across six educational institutions, merged with supplementary demographic, financial, and employment status data. The research employs advanced analytical techniques including network analysis, predictive modeling, and temporal pattern recognition to uncover both macro-level institutional trends and micro-level student mobility patterns. The dataset's longitudinal nature, spanning fifteen months of enrollment records, provides unique insights into the complex dynamics of student pathways through higher education systems.
    
    Our methodological approach combines quantitative analysis of enrollment durations, transfer probabilities, and financial indicators with qualitative ...
    
    The research contributes to the growing body of literature on student mobility by providing empirical evidence of institutional transfer networks and their relationship to student outcomes...
    .....
    

    deepanalyze

    For more examples and task completion details, please refer to DeepAnalyze's homepage.

API

  • You can build an OpenAI-Style API, using this script (note to change MODEL_PATH = "DeepAnalyze-8B" in demo/backend.py to your vllm model name):

    python demo/backend.py
    
  • API usage (streaming response):

    curl -X POST http://localhost:8200/chat/completions \
         -H "Content-Type: application/json" \
         -d '{
               "messages": [
                 {
                   "role": "user",
                   "content": "Generate a data science report."
                 }
               ],
               "workspace": "example/student_loan/"
             }'
    

๐ŸŽˆ Develop Your Own DeepAnalyze

1. Download Model and Training Data

  • Download DeepSeek-R1-0528-Qwen3-8B. Or you can directly finetune based on DeepAnalyze-8B.

    • If you use DeepSeek-R1-0528-Qwen3-8B as the base model, you should add the special tokens, using:

      MODEL_PATH=path_to_DeepSeek-R1-0528-Qwen3-8B
      SAVE_PATH=path_to_save_DeepSeek-R1-0528-Qwen3-8B-addvocab
      
      python deepanalyze/add_vocab.py \
        --model_path "$MODEL_PATH" \
        --save_path "$SAVE_PATH" \
        --add_tags
  • Download training data DataScience-Instruct-500K.

    • unzip DataScience-Instruct-500K/RL/data.zip

2. Curriculum-based Agentic Training

3. Evaluation

  • We have unified the evaluation of most existing data science benchmarks using vLLM (with more being continuously added...). You can directly follow the introduction in ./playground to quickly evaluate DeepAnalyze or your own agent.

๐Ÿ‘ Contribution

We welcome all forms of contributions, and merged PRs will be listed as contributors.

Contribution on Code and Model

  • We welcome all forms of contributions on DeepAnalyze's code, model and UI, such as Docker packaging, DeepAnalyze model conversion and quantization, and submitting DeepAnalyze workflows based on closed-source LLMs.
  • You can submit a pull request directly.

Contribution on Case Study

  • We also especially encourage you to share your use cases and feedback when using DeepAnalyze; these are extremely valuable for helping us improve DeepAnalyze.
  • You can place your use cases in a new folder under .example/. We recommend following the folder structure of .example/analysis_on_student_loan/, which includes three parts:
    • data/: stores the uploaded files
    • prompt.txt: input instructions
    • README.md: documentation. We suggest including the input, DeepAnalyzeโ€™s output, outputs from other closed-source LLMs (optional), and your evaluation/comments of the case.
  • DeepAnalyze only has 8B parameters, so we also welcome examples where DeepAnalyze performs slightly worse than the closed-source LLMs โ€” this will help us improve DeepAnalyze.

๐Ÿค Acknowledgement

๐Ÿ–‹Citation

If this repository is useful for you, please cite as:

@misc{deepanalyze,
      title={DeepAnalyze: Agentic Large Language Models for Autonomous Data Science}, 
      author={Shaolei Zhang and Ju Fan and Meihao Fan and Guoliang Li and Xiaoyong Du},
      year={2025},
      eprint={2510.16872},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.16872}, 
}

If you have any questions, please feel free to submit an issue or contact zhangshaolei98@ruc.edu.cn.

๐ŸŒŸ Misc

Welcome to join the DeepAnalyze WeChat group, chat and share ideas with others!

DeepAnalyze

If you like DeepAnalyze, give it a GitHub Star โญ.

Star History Chart