LeCoNav (Legacy Code Navigator) is an AI-powered platform designed to help developers understand, document, and interact with large, complex, and legacy codebases. By leveraging a Retrieval-Augmented Generation (RAG) pipeline, LeCoNav allows you to "chat" with your source code, ask complex questions, and receive context-aware answers, drastically reducing the time spent on code archeology.
## 🎯 The Core Challenge: Semantic Code Understanding
LLMs are powerful, but they are not inherently experts in software architecture. When analyzing source code, simply treating it as a continuous stream of text leads to catastrophic context fragmentation. A naive approach, like splitting a file every N characters, will inevitably break the semantic integrity of the code.
| Naive Chunking (`RecursiveCharacterTextSplitter`) | ✅ Intelligent Chunking (Our Approach) |
| :``` | :``` |
| ❌ Breaks functions and classes mid-definition. | ✅ Preserves the integrity of logical code units (functions, classes, methods). |
| ❌ Separates comments and docstrings from their code. | ✅ Associates documentation and comments with their corresponding code block. |
| ❌ Creates chunks with little to no semantic meaning. | ✅ Creates chunks that represent a single, complete semantic concept. |
| ❌ Lacks crucial context (e.g., imports, parent class). | ✅ Enriches chunks with vital metadata: language, unit type, name, and line numbers. |
| **Result: Low-quality retrieval, incorrect answers.** | **Result: High-quality, precise context retrieval for accurate LLM responses.** |
Our core mission is to replace this naive method with a **syntax-aware parsing strategy**, ensuring each piece of vectorized code represents a complete, logical unit.
LeCoNav is built on a modern, scalable microservices architecture designed for asynchronous processing of large codebases.
graph TD
subgraph User Interaction
A[Developer] -->|1. Upload .zip| B(FastAPI Web API)
end
subgraph Backend Infrastructure
B -->|2. Enqueue Task| C{Redis Broker}
C -->|3. Fetch Task| D[Celery Worker]
end
subgraph AI & Data Processing
D -->|4. Split & Parse Code| E{Intelligent Chunker}
E -->|5. Generate Embeddings| F(Ollama LLM)
E -->|6. Store Metadata| G[(MongoDB)]
F -->|7. Store Chunks & Vectors| H[(Weaviate Vector DB)]
end
style A fill:#cde4ff
style B fill:#90caf9
style D fill:#fff59d
## ✨ Key Features
* **Asynchronous Processing:** Upload entire project zip files and let the Celery workers handle the heavy lifting in the background without blocking the API.
* **Intelligent, Syntax-Aware Chunking:** Goes beyond simple text splitting to parse code into meaningful semantic units like functions and classes.
* **Extensible Language Support:** Designed from the ground up to support multiple programming languages with a clear strategy for adding more.
* **Rich Metadata:** Each code chunk is enriched with valuable metadata (language, unit type, name, location) for precise context retrieval.
* **Graceful Fallback:** Non-parseable files (like `.md` or `.properties`) are still indexed using a standard text splitter, ensuring no information is lost.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- Docker & Docker Compose
-
Generate the Project Structure: If you are starting from scratch, use the setup script to generate all necessary files and directories.
chmod +x setup_project.sh ./setup_project.sh
-
Build and Run the Services: Use the management script to build the Docker images and launch all services (FastAPI, Celery, Redis).
chmod +x manage.sh ./manage.sh rebuild
To simply start the services if they are already built, use
./manage.sh start. -
Verify the Services:
- API is available at
http://localhost:8000. - Redis is exposed at
localhost:6379.
- API is available at
## Workflow & Usage
### 1. Upload a Project
Package your source code into a `.zip` file and upload it to the processing endpoint.
```bash
curl -X POST -F "file=@/path/to/your/project.zip" http://localhost:8000/api/v1/upload-and-process/
The API will immediately respond with a task_id.
{
"task_id": "a1b2c3d4-e5f6-7890-g1h2-i3j4k5l6m7n8",
"message": "File upload successful. Processing has started."
}Use the task_id to poll the status endpoint and retrieve the result once processing is complete.
curl http://localhost:8000/api/v1/tasks/status/a1b2c3d4-e5f6-7890-g1h2-i3j4k5l6m7n8
## 🗺️ Roadmap & TODO
This project is under active development. Our roadmap is focused on building a robust, intelligent, and user-friendly code analysis platform.
### Phase 1: Intelligent Parsing Core
- [ ] **Implement Syntax-Aware Chunking:**
- [ ] Replace `RecursiveCharacterTextSplitter` in the Celery worker with a new strategy based on `langchain.text_splitter.Language`.
- [ ] Add initial support for **Python** and **Java**.
- [ ] Implement the fallback mechanism for unsupported file types (`.properties`, `.xml`, `.md`, etc.).
- [ ] Implement the critical metadata enrichment for each chunk (`language`, `unit_type`, `unit_name`, `start_line`, `end_line`).
- [ ] **Support Additional File Types:**
- [ ] Add parsers for shell scripts (`.sh`, `.bat`).
- [ ] Add specific handling for configuration files, especially for **Spring Boot** (`.properties`, `.yml`) and **Gradle** (`.gradle`, `.kts`).
- [ ] **Implement Advanced Context Strategy:**
- [ ] Develop a mechanism to prepend relevant context (e.g., file-level imports, class docstrings) to function-level chunks.
### Phase 2: RAG Pipeline & UI
- [ ] **Integrate Vector & Document Databases:**
- [ ] Connect the Celery worker to **Weaviate** to store code chunks and their embeddings.
- [ ] Connect to **MongoDB** to store project metadata and file information.
- [ ] **Build the RAG Chain:**
- [ ] Implement the full retrieval and generation logic using LangChain, Ollama, and Weaviate.
- [ ] **Develop a User Interface:**
- [ ] Create a web UI for managing projects and uploaded files.
- [ ] Implement a "Code Documentation Generator" feature.
- [ ] Build the core "Chat with your Code" interface to interact with the RAG pipeline.
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. Please fork the repo and create a pull request.
## 📜 License
Distributed under the MIT License. See `LICENSE` for more information.