Skip to content

manishklach/local-rag-runtime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local RAG Runtime

Private Document Intelligence + Grounded LLM Inference (Fully Local)

A production-grade Retrieval Augmented Generation (RAG) runtime that enables private, offline querying of local documents using open‑source LLMs.

This system ingests PDFs, text files, and research documents, builds vector embeddings, and provides grounded answers via both CLI and Web UI interfaces.


🚀 Key Features

• 100% Local — No cloud dependencies
• Private document ingestion
• Vector similarity retrieval
• Citation-aware responses
• Grounded LLM inference
• Web-based chat interface
• Windows / Linux compatible


🧠 Architecture

User Query

Retriever (Chroma Vector DB)

Relevant Context Chunks

Prompt Grounding Layer

LLM Inference (Ollama / Mistral)

Answer + Sources


📂 Repository Structure

local-rag-runtime/ │ ├── ingest.py # Document ingestion pipeline
├── chat.py # CLI chat interface
├── rag_engine.py # Retrieval + grounding logic
├── webui.py # Gradio browser UI
│ ├── vector_db/ # Embedding storage
├── data/ # Source documents
│ ├── requirements.txt
└── README.md


⚙️ Installation

1️⃣ Clone Repo

git clone https://github.com/manishklach/local-rag-runtime.git
cd local-rag-runtime


2️⃣ Create Virtual Environment

Windows:

python -m venv venv
venv\Scripts\activate

Linux / Mac:

python3 -m venv venv
source venv/bin/activate


3️⃣ Install Dependencies

pip install -r requirements.txt

Manual install (if needed):

pip install langchain
pip install langchain-chroma
pip install langchain-huggingface
pip install sentence-transformers
pip install gradio
pip install chromadb
pip install requests


🤖 Install Local LLM (Ollama)

Download: https://ollama.com

Pull model:

ollama pull mistral

Start runtime:

ollama serve


📥 Document Ingestion

Place files inside:

data/

Run ingestion:

python ingest.py

This will:

• Split documents into chunks
• Generate embeddings
• Store vectors in Chroma DB


💬 CLI Chat

python chat.py

Example queries:

Explain the snapshot pipeline.
How does UVM paging work?
Describe GPU suspend lifecycle.


🌐 Web UI Chat

Launch browser interface:

python webui.py

Open:

http://127.0.0.1:7860

Features:

• Chat interface
• Grounded responses
• Source attribution


📊 Retrieval Quality

Current retrieval stack includes:

• Sentence‑Transformer embeddings
• Top‑K similarity search
• Context concatenation
• Prompt grounding
• Source tracking


🔒 Privacy Model

All processing occurs locally:

• Documents never leave machine
• No external LLM APIs
• Air‑gapped capable


🏷️ Releases

Version Features
v0.1 CLI RAG pipeline
v0.2 Web UI + citation retrieval
v0.3 Streaming + chat memory (planned)
v1.0 Enterprise runtime

🛠️ Roadmap

Planned upgrades:

• Inline citation highlighting
• Chunk scoring visualization
• Streaming token responses
• Multi-model switching
• Desktop packaging
• Kubernetes deployment


🔬 Example Use Cases

• Patent querying
• Research summarization
• Architecture review
• Codebase knowledge search
• Offline enterprise AI


👤 Author

Manish Keshav Lachwani
AI Infrastructure • GPU Runtime Systems • Memory Orchestration • RAG Architectures

GitHub: https://github.com/manishklach


⭐ Acknowledgements

Built on:

• LangChain
• ChromaDB
• Sentence Transformers
• Ollama
• Mistral LLM
• Gradio


🚀 Quick Start

ollama serve
ollama pull mistral
python ingest.py
python webui.py

Open browser → Ask questions → Get grounded answers.


Private AI. Local Intelligence. Zero Cloud.

About

Private Retrieval-Augmented Generation (RAG) runtime enabling offline document intelligence using local embeddings and Ollama-hosted LLMs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages