This project implements a document retrieval system that processes PDF files, converts them to images, generates embeddings using the ColQwen2 model, and provides a search API with a web interface. The system allows users to search for relevant PDF pages based on text queries, with results displayed in a user-friendly HTML table.
project/
├── app/
│ ├── static/
│ │ └── index.html # Web interface for search
│ ├── main.py # FastAPI application
│ ├── models.py # Pydantic models for API
│ └── services.py # Document retrieval logic
├── examples/
│ ├── colpali_inference.py # Example inference script for ColPali model
│ └── inference.py # General inference example script
├── utils/
│ ├── create_embedding.py # Script to generate embeddings
│ └── preprocess_pdf_documents.py # Script to convert PDFs to images
├── assets/
│ ├── pdf/
│ ├── preprocessed_documents/
│ └── doc_embeddings/
└── requirements.txt # Project dependencies\
- PDF Preprocessing: Converts PDFs to JPEG images for embedding.
- Embedding Generation: Uses the
ColQwen2model to generate embeddings for PDF pages. - Search API: FastAPI endpoint (
/search) to query embeddings and return top matching PDF pages. - Web Interface: HTML page with a search bar and results table, styled with Tailwind CSS.
- Python: 3.8 or higher
- CUDA (optional): For GPU acceleration with
ColQwen2(falls back to CPU if unavailable)
-
Clone the Repository (if applicable):
git clone https://github.com/bhqanhuit/DocumentRetrieval.git cd document-retrieval -
Install Dependencies:
pip install -r requirements.txt
The
requirements.txtincludes:fastapi==0.115.2 uvicorn==0.32.0 torch colpali_engine transformers pdf2image pillow tqdm python-multipart -
Prepare Input Directory:
- Place PDF files in
assets/pdf.
- Place PDF files in
Convert PDF files to JPEG images:
python utils/preprocess_pdfs.py- Input: PDFs in
data/pdf. - Output: JPEG images in
assets/preprocessed_documents(e.g.,doc1_page0.jpg).
Generate embeddings for the images using the ColQwen2 model:
python utils/create_embedding.py- Input: JPEG images in
assets/preprocessed_documents. - Output:
- Embeddings:
assets/doc_embeddings/image_embeddings.pt - Metadata:
assets/doc_embeddings/image_metadata.json
- Embeddings:
Start the FastAPI server:
uvicorn app.main:app --host 0.0.0.0 --port 8000- Access the web interface at
http://localhost:8000.
- Web Interface:
- Open
http://localhost:8000in a browser. - Enter a query (e.g., "how attention works") in the search bar.
- Click "Search" to view the top 5 matching PDF pages in a table.
- Open
- API Endpoint:
-
Endpoint:
POST /search -
Request: JSON body with a
queryfield. -
Example:
curl -X POST "http://localhost:8000/search" -H "Content-Type: application/json" -d '{"query": "what is colipali"}'
-
Response: JSON array of objects with
file_pathandpage_number.
-
- Place PDFs in
data/pdf. - Run
python utils/create_embedding.pyto generate embeddings. - Start the server with
uvicorn app.main:app --host 0.0.0.0 --port 8000. - Open
http://localhost:8000and search for "what is attention?".
This project is licensed under the MIT License. See the LICENSE file for details.
- Built with FastAPI, ColQwen2, and pdf2image.