This project implements a Retrieval-Augmented Generation (RAG) pipeline using FastAPI for the backend and React for the frontend. The backend handles document processing, question answering, and integrates with various language models and vector stores. The frontend provides a user interface for interacting with the backend services.
In order to provide accurate and contextually relevant answers to user queries, we have implemented a Retrieval-Augmented Generation (RAG) pipeline.
The RAG pipeline documents consist of several types of documents:
- Legal documents (e.g., regulations, laws)
- Business documents (e.g., company-specific terms/jargon)
The documents are processed and indexed using FAISS, a popular vector store, to enable efficient retrieval based on semantic similarity.
We decided to use Qwen3-8B as our main LLM model for generation (the largest model we could load locally). This model was finetuned on a custom dataset to better handle domain-specific queries. The finetuning process involved:
- Generating synthetic data using Gemini 2.5 Flash
This attempts to perform supervised fine-tuning (SFT) and knowledge distillation (KD) to enhance the model's performance on specific tasks by learning from high-quality data and a teacher model.
By using Pydantic, we are able to ensure the output of the LLM is structured and adheres to a predefined schema. This greatly helps in parsing and utilising the generated content effectively.
To ensure that user queries are well-formed and relevant, we implemented a query rewriting step using Ollama's LLM. This step reformulates the user's question to improve clarity and context before passing it to the retrieval and generation components. This helps in reducing ambiguity and enhancing the quality of the retrieved documents.
To mitigate the risk of hallucinations in the generated responses, we incorporated a hallucination check step. This step evaluates the generated answer against the retrieved documents to ensure factual accuracy. If the confidence score of the answer is below a certain threshold, the system flags it for review or requests additional information.
It will retry the generation step up to 3 times if the hallucination confidence is below the threshold.
To maintain transparency and accountability, we log all interactions with the RAG pipeline. This includes:
- User queries (timestamp, feature, feature description, answer)
Since the log cannot be tampered with by users, it provides an audit trail for all interactions, which is crucial for compliance and review purposes.
The RAG pipeline supports both single-question answering and batch processing of multiple queries. This flexibility allows users to efficiently handle large volumes of questions, making it suitable for various applications.
Users would also be able to verify single features without having to upload a CSV file.
By allowing users to provide additional context or memory, the RAG pipeline can generate more informed and relevant answers.
The pipeline can be improved by users without having to retrain or modify the model.
- React
- Tailwind CSS
- FastAPI
- LangGraph
- Ollama
- Pydantic
- FAISS
- Qwen
- NOMIC
- PyTorch
- Unsloth (Transformers, BitsAndBytes, etc.)
- Qwen
- Visual Studio Code
- Git
- Linux
- Windows
- WSL2
- Gemini 2.5 Flash - used to generate synthetic data for finetuning
Setup instructions for the backend server using FastAPI and Uvicorn.
-
Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activateOn Windows use:
python -m venv .venv .venv\Scripts\activate
-
Install the required packages in the virtual environment:
pip install -r requirements.txt
-
Run the FastAPI server:
uvicorn api.main:app --host 0.0.0.0 --port 8000
Setup instructions for the frontend using React.
- Navigate to the frontend directory and install dependencies:
cd frontend npm install - Start the development server:
npm start
Instructions for fine-tuning the model using the provided dataset.
- Navigate to the fine-tuning directory:
cd fine_tuning - Axolotl requires linux or WSL2 on Windows. Ensure you have the necessary environment set up.
- Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate - Install the required packages in the virtual environment:
pip install -r requirements.txt
- Run the notebook:
jupyter notebook
Instructions for running the trained GGUF model using Ollama.
- Load model into Ollama:
cd finetuning ollama create <model-name> -f Modelfile
- Run the model:
ollama serve
- Due to upload limits, the model weights are split in 5 parts. Download all parts from the releases section and place them in the
fine_tuning/weightsdirectory. - Combine the parts using 7zip
YouTube Video: https://www.youtube.com/watch?v=Pf6fJ8ReJFo



