SciPaper Hub provides two major components:
- Similarity Service – A FastAPI application that exposes
POST /searchand returns the top-k similar arXiv papers for a supplied URL by querying a Vertex AI Vector Search index. - Data Pipelines – A set of pipelines to harvest, normalize, and index arXiv metadata into Vertex AI Vector Search.
- Python 3.10+
- Access to Google Cloud APIs (Vertex AI, Cloud Storage)
- The following environment variables must be configured:
PROJECT_ID: Your Google Cloud project ID.REGION: The default region for GCP resources (e.g.,us-central1).DATA_BUCKET: The name of the Google Cloud Storage bucket for pipeline artifacts.INDEX_ENDPOINT_ID: The ID of the Vertex AI Vector Search Index Endpoint.DEPLOYED_INDEX_ID: The ID of the deployed index on the endpoint.
- The following are optional, but recommended for a full deployment:
VERTEX_LOCATION: The location of your Vertex AI resources, if different fromREGION.B_DEPLOYED_INDEX_ID: The ID of a second deployed index for A/B testing.GIT_SHA: The Git commit SHA, for provenance tracking.IMAGE_DIGEST: The container image digest, for provenance tracking.
Install dependencies:
pip install -r requirements.txtuvicorn service.search_api:app --host 0.0.0.0 --port 8080Send a request:
curl -X POST http://localhost:8080/search \
-H "Content-Type: application/json" \
-d '{"url": "https://arxiv.org/abs/1706.03762", "k": 5}'Quick health checks when deployed (replace $SERVICE_URL with your endpoint):
# The health check endpoint is available at /health, /healthz, and /healhz
curl -s -i "$SERVICE_URL/healthz"
curl -s "$SERVICE_URL/openapi.json" | jq '.paths["/search"]'
# Use a recent, valid paper for testing
curl -s -X POST "$SERVICE_URL/search" \
-H 'Content-Type: application/json' \
-d '{"url":"https://arxiv.org/abs/2401.08406","k":5}' | jq .-
Harvest – Fetches arXiv Atom entries for target categories and saves the raw feeds to Cloud Storage. The time window can be configured.
python -m pipelines.harvest --mode incremental --categories "cs.AI cs.LG" --start_offset_days 7The
start_offset_daysparameter controls how many days in the past the harvest begins from. For example,1means yesterday. -
Normalize – Deduplicates and transforms the raw XML into a Parquet file with rich metadata.
python -m pipelines.normalize <snapshot-id>
The
<snapshot-id>is the timestamp identifier generated by the harvest step (e.g.,20240101T000000Z). -
Indexer – Embeds abstracts and upserts them into the Vertex AI Vector Search index.
python -m pipelines.indexer <snapshot-id>
Note on Dimensions: We use
text-embedding-005as the embedding model. The Vertex Vector Search index must be created and configured with a dimension of 768. If the model is ever changed, the index must be recreated with a matching dimension, or the app will fail the startup dimension check.
Each step accepts optional flags to override defaults; see the module docstrings for details.
- The arXiv client enforces a 3-second delay between requests to stay within the published API guidelines.
- Embeddings are cached in memory per process to avoid redundant Vertex AI calls.
- Vector metadata stores serialized JSON for structured fields (authors, categories) to support detailed responses in the
/searchAPI.
The search service supports A/B testing of different deployed vector search indexes. This is controlled by the B_DEPLOYED_INDEX_ID environment variable. If this variable is set, a small percentage of traffic (currently 10%, based on the client's IP address hash) will be routed to the index specified by B_DEPLOYED_INDEX_ID.
In addition to user_group ("A" or "B") and model_version, the RECO_RESPONSE structured logs also include the following provenance fields for tracking and debugging:
data_snapshot_id: The identifier for the ingested data batch.pipeline_git_sha: The Git SHA of the deployed code.container_image_digest: The digest of the deployed container image.
The GitHub Actions workflows in .github/workflows build the repository and publish the container image to Artifact Registry at us-central1-docker.pkg.dev/paperrec-ai/containers/paperrec-search.
- Service Deployment: The
deploy-service.ymlworkflow deploys the image to thepaperrec-searchCloud Run service. - Pipeline Job: The
deploy-job.ymlworkflow creates or updates a Cloud Run job namedpaperrec-search-harvest. This job is configured to run theharvestpipeline for a 90-day window.
For a production environment, it is recommended to trigger the Cloud Run job on a schedule using Cloud Scheduler.
Refer to the official documentation for Cloud Build, Cloud Run, and Workload Identity Federation for detailed setup steps.
This project is licensed under the MIT License. See LICENSE.