Skip to content

feat(ingestion): add document registry and local ingestion backbone#33

Merged
kpeez merged 4 commits intomainfrom
doc-ingest
Mar 9, 2026
Merged

feat(ingestion): add document registry and local ingestion backbone#33
kpeez merged 4 commits intomainfrom
doc-ingest

Conversation

@kpeez
Copy link
Copy Markdown
Owner

@kpeez kpeez commented Mar 9, 2026

This pull request introduces several significant improvements and additions to the backend, focusing on document ingestion, embedding infrastructure, and codebase modernization. The highlights include a major database migration for tracking document ingestion jobs, the addition of a Node.js embedding runtime, and refactoring to unify import paths and configuration management. Below are the most important changes grouped by theme:

Database and Ingestion Infrastructure:

  • Added a new Alembic migration (0002_ingestion_registry.py) that expands the documents and document_chunks tables with new columns (e.g., parser_id, chunker_id, embedding_model_id, chunk_count, error fields) and introduces a new ingestion_jobs table to track document ingestion attempts and statuses. This migration also adds relevant constraints and indexes for data integrity and performance.
  • Updated imports throughout the backend to use the new paperchat.db.schema path instead of the previous paperchat_backend.db.schema, ensuring consistency with the refactored package structure. [1] [2] [3] [4]

Embedding Runtime and Configuration:

  • Introduced a new Node.js embedding runtime (embedder.mjs) that uses the node-llama-cpp library to generate text embeddings, along with its own package.json for dependency management. This enables efficient, language-agnostic embedding generation. [1] [2]
  • Expanded configuration in config.py to include embedding model and cache directory settings, with new helper functions for retrieving embedding model names and cache locations, supporting environment variable overrides. [1] [2]

API and Service Layer:

  • Added a new FastAPI router for document management (api/documents.py), providing endpoints for importing, listing, retrieving, retrying, and deleting documents, and exposing ingestion job status to clients.
  • Updated the API and benchmarks code to use the refactored import paths (paperchat.models, paperchat.services, etc.), removing legacy references to paperchat_backend. [1] [2] [3] [4] [5] [6] [7]

Developer Experience and CI:

  • Improved the pre-commit configuration for Python code quality checks, updating the ruff hooks to use the uv runner directly and refining file matching patterns for better accuracy.
  • Updated the CI workflow to use a specific version of pnpm (10.30.3) for more reproducible builds. (.github/workflows/ci.yml)

Database Session Management:

  • Added a get_session_factory() function to db/engine.py to provide a reusable SQLAlchemy session factory, simplifying session management across the backend.

These changes collectively improve the backend's scalability, maintainability, and extensibility, particularly around document ingestion and embedding workflows.

@kpeez kpeez merged commit 16bd218 into main Mar 9, 2026
2 checks passed
@kpeez kpeez deleted the doc-ingest branch March 9, 2026 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant