feat(ingestion): add document registry and local ingestion backbone by kpeez · Pull Request #33 · kpeez/paperchat

kpeez · 2026-03-09T21:19:58Z

This pull request introduces several significant improvements and additions to the backend, focusing on document ingestion, embedding infrastructure, and codebase modernization. The highlights include a major database migration for tracking document ingestion jobs, the addition of a Node.js embedding runtime, and refactoring to unify import paths and configuration management. Below are the most important changes grouped by theme:

Database and Ingestion Infrastructure:

Added a new Alembic migration (0002_ingestion_registry.py) that expands the documents and document_chunks tables with new columns (e.g., parser_id, chunker_id, embedding_model_id, chunk_count, error fields) and introduces a new ingestion_jobs table to track document ingestion attempts and statuses. This migration also adds relevant constraints and indexes for data integrity and performance.
Updated imports throughout the backend to use the new paperchat.db.schema path instead of the previous paperchat_backend.db.schema, ensuring consistency with the refactored package structure. [1] [2] [3] [4]

Embedding Runtime and Configuration:

Introduced a new Node.js embedding runtime (embedder.mjs) that uses the node-llama-cpp library to generate text embeddings, along with its own package.json for dependency management. This enables efficient, language-agnostic embedding generation. [1] [2]
Expanded configuration in config.py to include embedding model and cache directory settings, with new helper functions for retrieving embedding model names and cache locations, supporting environment variable overrides. [1] [2]

API and Service Layer:

Added a new FastAPI router for document management (api/documents.py), providing endpoints for importing, listing, retrieving, retrying, and deleting documents, and exposing ingestion job status to clients.
Updated the API and benchmarks code to use the refactored import paths (paperchat.models, paperchat.services, etc.), removing legacy references to paperchat_backend. [1] [2] [3] [4] [5] [6] [7]

Developer Experience and CI:

Improved the pre-commit configuration for Python code quality checks, updating the ruff hooks to use the uv runner directly and refining file matching patterns for better accuracy.
Updated the CI workflow to use a specific version of pnpm (10.30.3) for more reproducible builds. (.github/workflows/ci.yml)

Database Session Management:

Added a get_session_factory() function to db/engine.py to provide a reusable SQLAlchemy session factory, simplifying session management across the backend.

These changes collectively improve the backend's scalability, maintainability, and extensibility, particularly around document ingestion and embedding workflows.

kpeez added 4 commits March 9, 2026 14:08

fix(backend): finish the paperchat package rename and verification loop

aa870fa

feat(ingestion): add the document registry and local ingestion backbone

9ab2927

test(backend): add Postgres-backed lifecycle and failure-path coverage

779d002

misc updates

2895cba

kpeez merged commit 16bd218 into main Mar 9, 2026
2 checks passed

kpeez deleted the doc-ingest branch March 9, 2026 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingestion): add document registry and local ingestion backbone#33

feat(ingestion): add document registry and local ingestion backbone#33
kpeez merged 4 commits intomainfrom
doc-ingest

kpeez commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kpeez commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant