Skip to content

PaperRec: arXiv-powered paper recommender. Harvests cs.AI/cs.LG/cs.CL/cs.RO/cs.CV, embeds abstracts (Gemini), indexes in Vertex AI Vector Search, and serves a /search API on Cloud Run.

License

Notifications You must be signed in to change notification settings

DevDizzle/SciPaper-Hub

Repository files navigation

SciPaper Hub

SciPaper Hub provides two major components:

  • Similarity Service – A FastAPI application that exposes POST /search and returns the top-k similar arXiv papers for a supplied URL by querying a Vertex AI Vector Search index.
  • Data Pipelines – A set of pipelines to harvest, normalize, and index arXiv metadata into Vertex AI Vector Search.

Getting Started

Requirements

  • Python 3.10+
  • Access to Google Cloud APIs (Vertex AI, Cloud Storage)
  • The following environment variables must be configured:
    • PROJECT_ID: Your Google Cloud project ID.
    • REGION: The default region for GCP resources (e.g., us-central1).
    • DATA_BUCKET: The name of the Google Cloud Storage bucket for pipeline artifacts.
    • INDEX_ENDPOINT_ID: The ID of the Vertex AI Vector Search Index Endpoint.
    • DEPLOYED_INDEX_ID: The ID of the deployed index on the endpoint.
  • The following are optional, but recommended for a full deployment:
    • VERTEX_LOCATION: The location of your Vertex AI resources, if different from REGION.
    • B_DEPLOYED_INDEX_ID: The ID of a second deployed index for A/B testing.
    • GIT_SHA: The Git commit SHA, for provenance tracking.
    • IMAGE_DIGEST: The container image digest, for provenance tracking.

Install dependencies:

pip install -r requirements.txt

Running the similarity API

uvicorn service.search_api:app --host 0.0.0.0 --port 8080

Send a request:

curl -X POST http://localhost:8080/search \
  -H "Content-Type: application/json" \
  -d '{"url": "https://arxiv.org/abs/1706.03762", "k": 5}'

Quick health checks when deployed (replace $SERVICE_URL with your endpoint):

# The health check endpoint is available at /health, /healthz, and /healhz
curl -s -i "$SERVICE_URL/healthz"

curl -s "$SERVICE_URL/openapi.json" | jq '.paths["/search"]'

# Use a recent, valid paper for testing
curl -s -X POST "$SERVICE_URL/search" \
  -H 'Content-Type: application/json' \
  -d '{"url":"https://arxiv.org/abs/2401.08406","k":5}' | jq .

Data Pipelines

  1. Harvest – Fetches arXiv Atom entries for target categories and saves the raw feeds to Cloud Storage. The time window can be configured.

    python -m pipelines.harvest --mode incremental --categories "cs.AI cs.LG" --start_offset_days 7

    The start_offset_days parameter controls how many days in the past the harvest begins from. For example, 1 means yesterday.

  2. Normalize – Deduplicates and transforms the raw XML into a Parquet file with rich metadata.

    python -m pipelines.normalize <snapshot-id>

    The <snapshot-id> is the timestamp identifier generated by the harvest step (e.g., 20240101T000000Z).

  3. Indexer – Embeds abstracts and upserts them into the Vertex AI Vector Search index.

    python -m pipelines.indexer <snapshot-id>

    Note on Dimensions: We use text-embedding-005 as the embedding model. The Vertex Vector Search index must be created and configured with a dimension of 768. If the model is ever changed, the index must be recreated with a matching dimension, or the app will fail the startup dimension check.

Each step accepts optional flags to override defaults; see the module docstrings for details.

Development Notes

  • The arXiv client enforces a 3-second delay between requests to stay within the published API guidelines.
  • Embeddings are cached in memory per process to avoid redundant Vertex AI calls.
  • Vector metadata stores serialized JSON for structured fields (authors, categories) to support detailed responses in the /search API.

A/B Testing

The search service supports A/B testing of different deployed vector search indexes. This is controlled by the B_DEPLOYED_INDEX_ID environment variable. If this variable is set, a small percentage of traffic (currently 10%, based on the client's IP address hash) will be routed to the index specified by B_DEPLOYED_INDEX_ID.

In addition to user_group ("A" or "B") and model_version, the RECO_RESPONSE structured logs also include the following provenance fields for tracking and debugging:

  • data_snapshot_id: The identifier for the ingested data batch.
  • pipeline_git_sha: The Git SHA of the deployed code.
  • container_image_digest: The digest of the deployed container image.

Deploying on GCP

The GitHub Actions workflows in .github/workflows build the repository and publish the container image to Artifact Registry at us-central1-docker.pkg.dev/paperrec-ai/containers/paperrec-search.

  • Service Deployment: The deploy-service.yml workflow deploys the image to the paperrec-search Cloud Run service.
  • Pipeline Job: The deploy-job.yml workflow creates or updates a Cloud Run job named paperrec-search-harvest. This job is configured to run the harvest pipeline for a 90-day window.

For a production environment, it is recommended to trigger the Cloud Run job on a schedule using Cloud Scheduler.

Refer to the official documentation for Cloud Build, Cloud Run, and Workload Identity Federation for detailed setup steps.

License

This project is licensed under the MIT License. See LICENSE.

About

PaperRec: arXiv-powered paper recommender. Harvests cs.AI/cs.LG/cs.CL/cs.RO/cs.CV, embeds abstracts (Gemini), indexes in Vertex AI Vector Search, and serves a /search API on Cloud Run.

Topics

Resources

License

Stars

Watchers

Forks