Skip to content

mfdogalindo/LeCoNav

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LeCoNav: Legacy Code Navigator

Build Status Version License Python Framework

LeCoNav (Legacy Code Navigator) is an AI-powered platform designed to help developers understand, document, and interact with large, complex, and legacy codebases. By leveraging a Retrieval-Augmented Generation (RAG) pipeline, LeCoNav allows you to "chat" with your source code, ask complex questions, and receive context-aware answers, drastically reducing the time spent on code archeology.


## 🎯 The Core Challenge: Semantic Code Understanding

LLMs are powerful, but they are not inherently experts in software architecture. When analyzing source code, simply treating it as a continuous stream of text leads to catastrophic context fragmentation. A naive approach, like splitting a file every N characters, will inevitably break the semantic integrity of the code.

| Naive Chunking (`RecursiveCharacterTextSplitter`) | ✅ Intelligent Chunking (Our Approach) |
| :``` | :``` |
| ❌ Breaks functions and classes mid-definition. | ✅ Preserves the integrity of logical code units (functions, classes, methods). |
| ❌ Separates comments and docstrings from their code. | ✅ Associates documentation and comments with their corresponding code block. |
| ❌ Creates chunks with little to no semantic meaning. | ✅ Creates chunks that represent a single, complete semantic concept. |
| ❌ Lacks crucial context (e.g., imports, parent class). | ✅ Enriches chunks with vital metadata: language, unit type, name, and line numbers. |
| **Result: Low-quality retrieval, incorrect answers.** | **Result: High-quality, precise context retrieval for accurate LLM responses.** |

Our core mission is to replace this naive method with a **syntax-aware parsing strategy**, ensuring each piece of vectorized code represents a complete, logical unit.

🏛️ System Architecture

LeCoNav is built on a modern, scalable microservices architecture designed for asynchronous processing of large codebases.

graph TD
    subgraph User Interaction
        A[Developer] -->|1. Upload .zip| B(FastAPI Web API)
    end

    subgraph Backend Infrastructure
        B -->|2. Enqueue Task| C{Redis Broker}
        C -->|3. Fetch Task| D[Celery Worker]
    end

    subgraph AI & Data Processing
        D -->|4. Split & Parse Code| E{Intelligent Chunker}
        E -->|5. Generate Embeddings| F(Ollama LLM)
        E -->|6. Store Metadata| G[(MongoDB)]
        F -->|7. Store Chunks & Vectors| H[(Weaviate Vector DB)]
    end

    style A fill:#cde4ff
    style B fill:#90caf9
    style D fill:#fff59d
Loading

## ✨ Key Features

* **Asynchronous Processing:** Upload entire project zip files and let the Celery workers handle the heavy lifting in the background without blocking the API.
* **Intelligent, Syntax-Aware Chunking:** Goes beyond simple text splitting to parse code into meaningful semantic units like functions and classes.
* **Extensible Language Support:** Designed from the ground up to support multiple programming languages with a clear strategy for adding more.
* **Rich Metadata:** Each code chunk is enriched with valuable metadata (language, unit type, name, location) for precise context retrieval.
* **Graceful Fallback:** Non-parseable files (like `.md` or `.properties`) are still indexed using a standard text splitter, ensuring no information is lost.

🚀 Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

  • Docker & Docker Compose

Installation & Launch

  1. Generate the Project Structure: If you are starting from scratch, use the setup script to generate all necessary files and directories.

    chmod +x setup_project.sh
    ./setup_project.sh
  2. Build and Run the Services: Use the management script to build the Docker images and launch all services (FastAPI, Celery, Redis).

    chmod +x manage.sh
    ./manage.sh rebuild

    To simply start the services if they are already built, use ./manage.sh start.

  3. Verify the Services:

    • API is available at http://localhost:8000.
    • Redis is exposed at localhost:6379.

## Workflow & Usage

### 1. Upload a Project

Package your source code into a `.zip` file and upload it to the processing endpoint.

```bash
curl -X POST -F "file=@/path/to/your/project.zip" http://localhost:8000/api/v1/upload-and-process/

The API will immediately respond with a task_id.

{
  "task_id": "a1b2c3d4-e5f6-7890-g1h2-i3j4k5l6m7n8",
  "message": "File upload successful. Processing has started."
}

2. Check Processing Status

Use the task_id to poll the status endpoint and retrieve the result once processing is complete.

curl http://localhost:8000/api/v1/tasks/status/a1b2c3d4-e5f6-7890-g1h2-i3j4k5l6m7n8

## 🗺️ Roadmap & TODO

This project is under active development. Our roadmap is focused on building a robust, intelligent, and user-friendly code analysis platform.

### Phase 1: Intelligent Parsing Core
-   [ ] **Implement Syntax-Aware Chunking:**
    -   [ ] Replace `RecursiveCharacterTextSplitter` in the Celery worker with a new strategy based on `langchain.text_splitter.Language`.
    -   [ ] Add initial support for **Python** and **Java**.
    -   [ ] Implement the fallback mechanism for unsupported file types (`.properties`, `.xml`, `.md`, etc.).
    -   [ ] Implement the critical metadata enrichment for each chunk (`language`, `unit_type`, `unit_name`, `start_line`, `end_line`).
-   [ ] **Support Additional File Types:**
    -   [ ] Add parsers for shell scripts (`.sh`, `.bat`).
    -   [ ] Add specific handling for configuration files, especially for **Spring Boot** (`.properties`, `.yml`) and **Gradle** (`.gradle`, `.kts`).
-   [ ] **Implement Advanced Context Strategy:**
    -   [ ] Develop a mechanism to prepend relevant context (e.g., file-level imports, class docstrings) to function-level chunks.

### Phase 2: RAG Pipeline & UI
-   [ ] **Integrate Vector & Document Databases:**
    -   [ ] Connect the Celery worker to **Weaviate** to store code chunks and their embeddings.
    -   [ ] Connect to **MongoDB** to store project metadata and file information.
-   [ ] **Build the RAG Chain:**
    -   [ ] Implement the full retrieval and generation logic using LangChain, Ollama, and Weaviate.
-   [ ] **Develop a User Interface:**
    -   [ ] Create a web UI for managing projects and uploaded files.
    -   [ ] Implement a "Code Documentation Generator" feature.
    -   [ ] Build the core "Chat with your Code" interface to interact with the RAG pipeline.

🤝 Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. Please fork the repo and create a pull request.


## 📜 License

Distributed under the MIT License. See `LICENSE` for more information.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published