Bits-Bytes

About

This project implements a Retrieval-Augmented Generation (RAG) pipeline using FastAPI for the backend and React for the frontend. The backend handles document processing, question answering, and integrates with various language models and vector stores. The frontend provides a user interface for interacting with the backend services.

Solution Overview

Retrieval-Augmented Generation (RAG) Pipeline

In order to provide accurate and contextually relevant answers to user queries, we have implemented a Retrieval-Augmented Generation (RAG) pipeline.

The RAG pipeline documents consist of several types of documents:

Legal documents (e.g., regulations, laws)
Business documents (e.g., company-specific terms/jargon)

The documents are processed and indexed using FAISS, a popular vector store, to enable efficient retrieval based on semantic similarity.

Models

We decided to use Qwen3-8B as our main LLM model for generation (the largest model we could load locally). This model was finetuned on a custom dataset to better handle domain-specific queries. The finetuning process involved:

Generating synthetic data using Gemini 2.5 Flash

This attempts to perform supervised fine-tuning (SFT) and knowledge distillation (KD) to enhance the model's performance on specific tasks by learning from high-quality data and a teacher model.

Pydantic

By using Pydantic, we are able to ensure the output of the LLM is structured and adheres to a predefined schema. This greatly helps in parsing and utilising the generated content effectively.

Pre and Post Guardrails

Query Rewriting

To ensure that user queries are well-formed and relevant, we implemented a query rewriting step using Ollama's LLM. This step reformulates the user's question to improve clarity and context before passing it to the retrieval and generation components. This helps in reducing ambiguity and enhancing the quality of the retrieved documents.

Hallucination Check

To mitigate the risk of hallucinations in the generated responses, we incorporated a hallucination check step. This step evaluates the generated answer against the retrieved documents to ensure factual accuracy. If the confidence score of the answer is below a certain threshold, the system flags it for review or requests additional information.

It will retry the generation step up to 3 times if the hallucination confidence is below the threshold.

Audit Ready Transparency

To maintain transparency and accountability, we log all interactions with the RAG pipeline. This includes:

User queries (timestamp, feature, feature description, answer)

Since the log cannot be tampered with by users, it provides an audit trail for all interactions, which is crucial for compliance and review purposes.

Single and Batch Processing

The RAG pipeline supports both single-question answering and batch processing of multiple queries. This flexibility allows users to efficiently handle large volumes of questions, making it suitable for various applications.

Users would also be able to verify single features without having to upload a CSV file.

Memory Integration

By allowing users to provide additional context or memory, the RAG pipeline can generate more informed and relevant answers.

The pipeline can be improved by users without having to retrain or modify the model.

Technologies Used

Frontend

React
Tailwind CSS

Backend

FastAPI
LangGraph
Ollama
Pydantic
FAISS
Qwen
NOMIC

Finetuning

PyTorch
Unsloth (Transformers, BitsAndBytes, etc.)
Qwen

Development tools

Visual Studio Code
Git
Linux
Windows
WSL2

API and Assets

Gemini 2.5 Flash - used to generate synthetic data for finetuning

Backend

Setup instructions for the backend server using FastAPI and Uvicorn.

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate

On Windows use:

python -m venv .venv
.venv\Scripts\activate

Install the required packages in the virtual environment:
```
pip install -r requirements.txt
```

Run the FastAPI server:

uvicorn api.main:app --host 0.0.0.0 --port 8000

Frontend

Setup instructions for the frontend using React.

Navigate to the frontend directory and install dependencies:
```
cd frontend
npm install
```
Start the development server:
```
npm start
```

Fine-tuning

Instructions for fine-tuning the model using the provided dataset.

Navigate to the fine-tuning directory:
```
cd fine_tuning
```
Axolotl requires linux or WSL2 on Windows. Ensure you have the necessary environment set up.

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate

Install the required packages in the virtual environment:
```
pip install -r requirements.txt
```
Run the notebook:
```
jupyter notebook
```

Running Trained GGUF Model

Instructions for running the trained GGUF model using Ollama.

Load model into Ollama:

cd finetuning
ollama create <model-name> -f Modelfile

Run the model:
```
ollama serve
```

Model Weights

Due to upload limits, the model weights are split in 5 parts. Download all parts from the releases section and place them in the fine_tuning/weights directory.
Combine the parts using 7zip

Demo

YouTube Video: https://www.youtube.com/watch?v=Pf6fJ8ReJFo

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
assets		assets
backend		backend
finetuning		finetuning
frontend		frontend
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
processed_sample_data.xlsx		processed_sample_data.xlsx
sample_data.xlsx		sample_data.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bits-Bytes

About

Solution Overview

Retrieval-Augmented Generation (RAG) Pipeline

Models

Pydantic

Pre and Post Guardrails

Query Rewriting

Hallucination Check

Audit Ready Transparency

Single and Batch Processing

Memory Integration

Technologies Used

Frontend

Backend

Finetuning

Development tools

API and Assets

Backend

Frontend

Fine-tuning

Running Trained GGUF Model

Model Weights

Demo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bits-Bytes

About

Solution Overview

Retrieval-Augmented Generation (RAG) Pipeline

Models

Pydantic

Pre and Post Guardrails

Query Rewriting

Hallucination Check

Audit Ready Transparency

Single and Batch Processing

Memory Integration

Technologies Used

Frontend

Backend

Finetuning

Development tools

API and Assets

Backend

Frontend

Fine-tuning

Running Trained GGUF Model

Model Weights

Demo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages