A production-grade Retrieval Augmented Generation (RAG) runtime that enables private, offline querying of local documents using open‑source LLMs.
This system ingests PDFs, text files, and research documents, builds vector embeddings, and provides grounded answers via both CLI and Web UI interfaces.
• 100% Local — No cloud dependencies
• Private document ingestion
• Vector similarity retrieval
• Citation-aware responses
• Grounded LLM inference
• Web-based chat interface
• Windows / Linux compatible
User Query
↓
Retriever (Chroma Vector DB)
↓
Relevant Context Chunks
↓
Prompt Grounding Layer
↓
LLM Inference (Ollama / Mistral)
↓
Answer + Sources
local-rag-runtime/
│
├── ingest.py # Document ingestion pipeline
├── chat.py # CLI chat interface
├── rag_engine.py # Retrieval + grounding logic
├── webui.py # Gradio browser UI
│
├── vector_db/ # Embedding storage
├── data/ # Source documents
│
├── requirements.txt
└── README.md
git clone https://github.com/manishklach/local-rag-runtime.git
cd local-rag-runtime
Windows:
python -m venv venv
venv\Scripts\activate
Linux / Mac:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Manual install (if needed):
pip install langchain
pip install langchain-chroma
pip install langchain-huggingface
pip install sentence-transformers
pip install gradio
pip install chromadb
pip install requests
Download: https://ollama.com
Pull model:
ollama pull mistral
Start runtime:
ollama serve
Place files inside:
data/
Run ingestion:
python ingest.py
This will:
• Split documents into chunks
• Generate embeddings
• Store vectors in Chroma DB
python chat.py
Example queries:
Explain the snapshot pipeline.
How does UVM paging work?
Describe GPU suspend lifecycle.
Launch browser interface:
python webui.py
Open:
Features:
• Chat interface
• Grounded responses
• Source attribution
Current retrieval stack includes:
• Sentence‑Transformer embeddings
• Top‑K similarity search
• Context concatenation
• Prompt grounding
• Source tracking
All processing occurs locally:
• Documents never leave machine
• No external LLM APIs
• Air‑gapped capable
| Version | Features |
|---|---|
| v0.1 | CLI RAG pipeline |
| v0.2 | Web UI + citation retrieval |
| v0.3 | Streaming + chat memory (planned) |
| v1.0 | Enterprise runtime |
Planned upgrades:
• Inline citation highlighting
• Chunk scoring visualization
• Streaming token responses
• Multi-model switching
• Desktop packaging
• Kubernetes deployment
• Patent querying
• Research summarization
• Architecture review
• Codebase knowledge search
• Offline enterprise AI
Manish Keshav Lachwani
AI Infrastructure • GPU Runtime Systems • Memory Orchestration • RAG Architectures
GitHub: https://github.com/manishklach
Built on:
• LangChain
• ChromaDB
• Sentence Transformers
• Ollama
• Mistral LLM
• Gradio
ollama serve
ollama pull mistral
python ingest.py
python webui.py
Open browser → Ask questions → Get grounded answers.
Private AI. Local Intelligence. Zero Cloud.