Ask questions. Get grounded answers from any document you bring.
Papertrail is a lightweight document question-answering agent built with Streamlit. It indexes a document, retrieves relevant evidence, and produces grounded answers with clear attribution. No hidden training data. Every answer is derived from the document you load.
- PDF Upload: robust text extraction using PyMuPDF
- URL Input: scrape and index webpage content
- Paste Text: index arbitrary text instantly
- Hybrid retrieval (semantic embeddings + TF-IDF)
- Cross-encoder reranking for precision
- Evidence-based answer generation
- Optional local LLM responses via Ollama
- Extractive fallback when generation is unavailable
- Section and page-level attribution
- Supporting passages viewer
Document
↓
Text extraction
↓
Chunking
↓
Hybrid retrieval (embeddings + TF-IDF)
↓
Cross-encoder reranking
↓
Evidence extraction (MMR)
↓
Answer generation (optional)
Pipeline summary:
Chunking → Hybrid retrieval → Neural rerank → Evidence grounding → Optional generation
Papertrail extracts document text using PyMuPDF, which preserves reading order and spacing more reliably than many PDF parsers.
Each paragraph is mapped to:
- its section heading (if detected)
- its page number
These mappings enable precise attribution in answers.
The document is split into overlapping text chunks so that contextual relationships are preserved across chunk boundaries. Chunking enables efficient indexing and retrieval across large documents.
Each chunk is indexed in two ways:
Using sentence-transformers. This captures semantic similarity.
Example:
advantages ≈ pros
drawbacks ≈ disadvantages
Using TF-IDF with bigrams. This captures exact phrases and technical terminology.
The weighted combined score:
score = α * dense_similarity + (1 - α) * tfidf_similarityHybrid retrieval improves recall across both semantic and lexical queries.
The top candidate chunks are reranked using a cross-encoder (MS MARCO MiniLM). Unlike embedding similarity, cross-encoders evaluate the question and chunk jointly, significantly improving ranking precision.
Maximal Marginal Relevance (MMR) selects a diverse set of high-relevance sentences from the top chunks. This produces an evidence pack that:
- reduces noise
- prevents redundancy
- ensures answers remain grounded
Papertrail supports three answer modes.
Produces a grounded extractive answer directly from the document. This mode guarantees:
- no hallucination
- deterministic behavior
- complete grounding in the source text
Uses a local LLM to synthesize an answer from retrieved evidence.
Advantages:
- higher fluency
- better explanations
- no external API required
Install Ollama:
https://ollama.ai
Then pull a model:
ollama pull llama3Uses serverless inference for generation when available.
If generation fails or times out, Papertrail automatically falls back to grounded extractive answers.
Each answer can include:
- detected section headings
- PDF page numbers
- supporting passages
This makes it easy to verify exactly where an answer came from.
pip install -r requirements.txt
streamlit run app.py- Push this repository to GitHub
- Go to share.streamlit.io
- Connect your repo
- Set
app.pyas the entry point - Deploy
| Component | Tool |
|---|---|
| UI | Streamlit |
| Embeddings | sentence-transformers |
| Sparse retrieval | scikit-learn TF-IDF |
| Neural reranking | CrossEncoder (MS MARCO MiniLM) |
| PDF parsing | PyMuPDF |
| Web scraping | requests + BeautifulSoup |
| Local LLM | Ollama |
Pure TF-IDF fails on synonyms.
Example:
pros and cons
advantages and disadvantages
Semantic retrieval fixes this, but pure embeddings miss exact keywords. Hybrid retrieval combines both.
Possible future improvements:
- Multi-document search
- Vector database backend (FAISS, Qdrant)
- Persistent embedding cache
- Structured citations
- Conversation memory
- Additional document formats (
.docx,.csv) - Improved OCR for scanned PDFs
The model should answer from your document, not from its training data. Every answer is grounded in retrieved evidence.