Skip to content

openplanetdata/openplanetdata-r2index

 
 

Repository files navigation

R2 Index

A queryable metadata index for files stored in Cloudflare R2. Built as a Cloudflare Worker with D1 database, it enables searching, filtering, and organizing files by category, entity, tags, and custom metadata — without scanning the object store.

Architecture

Client (Airflow, etc.)
    │
    ├─► Worker API ─► D1 (metadata CRUD/search)
    │
    └─► R2 (direct upload/download via S3-compatible API)

The Worker handles metadata only. File content is uploaded/downloaded directly to R2 using the S3-compatible API.

Who uses it?

Ipregistry Noticeable OpenPlanetData

Setup

1. Install dependencies

npm install

2. Create D1 database

wrangler d1 create r2index

Update wrangler.toml with the returned database_id.

3. Apply schema

npm run db:init

4. Set API token

wrangler secret put R2INDEX_API_TOKEN

5. Deploy

npm run deploy

Configuration

wrangler.toml

name = "r2index"
main = "src/index.ts"
compatibility_date = "2026-01-31"

routes = [
  { pattern = "r2index.acme.com/*", zone_name = "acme.com" }
]

[[d1_databases]]
binding = "D1"
database_name = "r2index"
database_id = "<your-database-id>"

[vars]
CACHE_MAX_AGE = "60"
# DOWNLOADS_RETENTION_DAYS = "365"

Environment Variables

Variable Description Default
CACHE_MAX_AGE Cache-Control max-age in seconds 60
DOWNLOADS_RETENTION_DAYS Days to keep download records before cleanup 365
R2INDEX_API_TOKEN Bearer token for API authentication (set via wrangler secret put) Required

Data Model

Core Fields

Field Description Example
category Product or service grouping acme
entity Specific dataset identifier acme-abuser, acme-geolocation
extension File format csv, csv.zip, mmdb
media_type MIME type text/csv, application/zip
name Human-readable name Abuser, Geolocation

Remote Location (Unique Constraint)

The tuple (bucket, remote_path, remote_filename, remote_version) uniquely identifies a file in R2:

Field Description Example
bucket S3/R2 bucket name my-bucket
remote_path Directory path in R2 acme/abuser
remote_filename File name in R2 abuser.csv
remote_version Version identifier 2026-02-03, v1

Optional Metadata

Field Description
checksum_md5 MD5 hash
checksum_sha1 SHA1 hash
checksum_sha256 SHA256 hash
checksum_sha512 SHA512 hash
deprecated Boolean flag
deprecation_reason Reason for deprecation
extra Arbitrary JSON (e.g., header_line, line_count)
metadata_path Path to associated metadata file
size File size in bytes
tags Array of tags for filtering

API Reference

All endpoints require authentication via Authorization: Bearer <token> header.

Health Check

GET /health

Returns { "status": "ok" }. No authentication required.

Create/Update File (Upsert)

POST /files

Creates or updates a file based on the unique constraint (bucket, remote_path, remote_filename, remote_version).

Request Body:

{
  "bucket": "my-bucket",
  "category": "acme",
  "entity": "acme-abuser",
  "extension": "csv",
  "media_type": "text/csv",
  "name": "Abuser",
  "remote_path": "acme/abuser",
  "remote_filename": "abuser.csv",
  "remote_version": "2026-02-03",
  "size": 5023465,
  "checksum_md5": "21a165f3ddef92b90dccb0c1bb4e249f",
  "checksum_sha1": "b588c39c691a2bc2cdd81e9f826ae9b5eb163e39",
  "checksum_sha256": "8dac526e40c250f3ad117d05452e04814e2c979754a2e4810d8f85413d188ba6",
  "checksum_sha512": "0f4bdedf66e5ec214aa1302d624913c2137c9cbfe1f81c0a63138c9ddd69d0c0",
  "extra": {
    "header_line": "# ip_start,ip_end",
    "line_count": 169964
  },
  "tags": ["ip", "security"]
}
Field Type Required Description
bucket string Yes S3/R2 bucket name
category string Yes Product or service grouping (e.g., acme)
checksum_md5 string No MD5 hash
checksum_sha1 string No SHA1 hash
checksum_sha256 string No SHA256 hash
checksum_sha512 string No SHA512 hash
entity string Yes Dataset identifier (e.g., acme-abuser)
extension string Yes File format (e.g., csv, mmdb)
extra object No Arbitrary JSON (merged into nested index output)
media_type string Yes MIME type (e.g., text/csv)
metadata_path string No Path to associated metadata file
name string No Human-readable name (e.g., Abuser)
remote_filename string Yes Filename in R2
remote_path string Yes Directory path in R2
remote_version string Yes Version identifier (e.g., 2026-02-03)
size integer No File size in bytes
tags string[] No Tags for filtering

Response: 201 Created (new) or 200 OK (updated) with file record (includes auto-generated id).

Get File

GET /files/:id

Response: 200 OK with file record or 404 Not Found.

Get File by Remote Tuple

GET /files/by-tuple

Retrieves a file by its unique remote tuple.

Query Parameters:

Parameter Type Required Description
bucket string Yes S3/R2 bucket name
remote_path string Yes Directory path in R2
remote_filename string Yes Filename in R2
remote_version string Yes Version identifier

Example Request:

curl "https://r2index.acme.com/files/by-tuple?bucket=my-bucket&remote_path=acme/abuser&remote_filename=abuser.csv&remote_version=2026-02-03"

Response: 200 OK with file record or 404 Not Found.

Update File

PUT /files/:id

Request Body: Any subset of file fields to update.

Response: 200 OK with updated file record.

Delete File by ID

DELETE /files/:id

Removes file metadata from the index. Does not delete the actual file in R2.

Response: 200 OK with { "success": true } or 404 Not Found.

Delete File by Remote Tuple

DELETE /files

Removes file metadata from the index. Does not delete the actual file in R2.

Request Body:

{
  "bucket": "my-bucket",
  "remote_path": "acme/abuser",
  "remote_filename": "abuser.csv",
  "remote_version": "2026-02-03"
}

Response: 200 OK with { "success": true } or 404 Not Found.

Search Files

GET /files

Query Parameters:

Parameter Type Description
bucket string Filter by bucket (exact match)
category string Filter by category (exact match)
deprecated boolean Filter by deprecated status (true or false)
entity string Filter by entity (exact match)
extension string Filter by extension (exact match)
limit integer Max results (default: 100, max: 1000)
media_type string Filter by media type (exact match)
offset integer Pagination offset (default: 0)
tags string Filter by tags (comma-separated, must have ALL)
group_by string Group results by field: bucket, category, entity, extension, media_type, deprecated

Example Requests:

# Get all files
curl "https://r2index.acme.com/files"

# Filter by category
curl "https://r2index.acme.com/files?category=acme"

# Filter by category and entity
curl "https://r2index.acme.com/files?category=acme&entity=acme-abuser"

# Filter by extension
curl "https://r2index.acme.com/files?extension=csv"

# Filter by tags (must have ALL specified tags)
curl "https://r2index.acme.com/files?tags=ip,security"

# Filter non-deprecated files only
curl "https://r2index.acme.com/files?deprecated=false"

# Combine filters with pagination
curl "https://r2index.acme.com/files?category=acme&extension=csv&limit=50&offset=0"

# Group by extension for a given category
curl "https://r2index.acme.com/files?category=acme&group_by=extension"

Grouped Response (when using group_by):

{
  "groups": [
    { "value": "csv", "count": 14 },
    { "value": "csv.zip", "count": 14 },
    { "value": "mmdb", "count": 14 }
  ],
  "total": 42
}

Response:

{
  "files": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "bucket": "my-bucket",
      "name": "Abuser",
      "category": "acme",
      "entity": "acme-abuser",
      "extension": "csv",
      "media_type": "text/csv",
      "remote_path": "acme/abuser",
      "remote_filename": "abuser.csv",
      "remote_version": "2026-02-03",
      "metadata_path": null,
      "size": 5023465,
      "checksum_md5": "21a165f3ddef92b90dccb0c1bb4e249f",
      "checksum_sha1": "b588c39c691a2bc2cdd81e9f826ae9b5eb163e39",
      "checksum_sha256": "8dac526e40c250f3ad117d05452e04814e2c979754a2e4810d8f85413d188ba6",
      "checksum_sha512": "0f4bdedf66e5ec214aa1302d624913c2137c9cbfe1f81c0a63138c9ddd69d0c0",
      "extra": {
        "header_line": "# ip_start,ip_end",
        "line_count": 169964
      },
      "deprecated": false,
      "deprecation_reason": "",
      "created": 1706918150000,
      "updated": 1706918150000,
      "tags": ["ip", "security"]
    }
  ],
  "total": 1
}

Get Nested Index

GET /files/index

Returns files grouped by entity then by extension in a nested structure. Useful for generating compatibility indexes.

Query Parameters: Same filters as Search Files (except limit, offset, group_by).

Example Request:

curl "https://r2index.acme.com/files/index?category=acme"

Response:

{
  "acme-abuser": {
    "csv": {
      "checksums": {
        "md5": "21a165f3ddef92b90dccb0c1bb4e249f",
        "sha1": "b588c39c691a2bc2cdd81e9f826ae9b5eb163e39",
        "sha256": "8dac526e40c250f3ad117d05452e04814e2c979754a2e4810d8f85413d188ba6",
        "sha512": "0f4bdedf66e5ec214aa1302d624913c2137c9cbfe1f81c0a63138c9ddd69d0c0c8ffc5f296a3c6d9c4256b86ca2a18b08c34e7f6897d152c16dde6526a07461f"
      },
      "file_size": "5023465",
      "last_updated": "2026-02-03T18:55:50.000Z",
      "name": "Abuser",
      "header_line": "# ip_start,ip_end",
      "line_count": 169964
    },
    "mmdb": {
      "checksums": {
        "md5": "2cf1f2d9ae301714b7ed7979553c76be"
      },
      "file_size": "10573656",
      "last_updated": "2026-02-03T18:55:50.000Z"
    }
  },
  "acme-as": {
    "csv": {
      "checksums": {
        "md5": "8356651e9e7bfdbf94fa41c9911d9cdf"
      },
      "file_size": "7355656",
      "last_updated": "2026-02-03T18:54:02.000Z",
      "name": "AS"
    }
  }
}

Extra fields from the extra JSON column are merged into each entry.

Record Download

POST /downloads

Records a file download event for analytics tracking.

Request Body:

{
  "bucket": "my-bucket",
  "remote_path": "acme/abuser",
  "remote_filename": "abuser.csv",
  "remote_version": "2026-02-03",
  "ip_address": "192.168.1.1",
  "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"
}
Field Type Required Description
bucket string Yes S3/R2 bucket name
ip_address string Yes Client IP address (IPv4 or IPv6)
remote_filename string Yes File name in R2
remote_path string Yes Directory path in R2
remote_version string Yes Version identifier
user_agent string No Client user agent string

Example Request:

curl -X POST "https://r2index.acme.com/downloads" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "my-bucket",
    "remote_path": "acme/abuser",
    "remote_filename": "abuser.csv",
    "remote_version": "2026-02-03",
    "ip_address": "192.168.1.1",
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"
  }'

Response: 201 Created with download record including pre-computed time buckets.

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "bucket": "my-bucket",
  "remote_path": "acme/abuser",
  "remote_filename": "abuser.csv",
  "remote_version": "2026-02-03",
  "ip_address": "192.168.1.1",
  "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
  "downloaded_at": 1706918150000,
  "hour_bucket": 1706914800000,
  "day_bucket": 1706832000000,
  "month_bucket": 202402
}

Analytics: Time Series

GET /analytics/timeseries

Returns download counts over time, grouped by hour, day, or month.

Query Parameters:

Parameter Type Required Description
bucket string No Filter by bucket
end integer Yes End timestamp (ms)
limit integer No Max files per bucket (default: 100, max: 1000)
remote_filename string No Filter by remote filename
remote_path string No Filter by remote path
remote_version string No Filter by remote version
scale string No Time bucket: hour, day, month (default: day)
start integer Yes Start timestamp (ms)

Example Request:

curl "https://r2index.acme.com/analytics/timeseries?start=1704067200000&end=1706745600000&scale=day"

Response:

{
  "scale": "day",
  "buckets": [
    {
      "timestamp": 1704067200000,
      "files": [
        {
          "id": "550e8400-e29b-41d4-a716-446655440000",
          "bucket": "my-bucket",
          "remote_path": "acme/abuser",
          "remote_filename": "abuser.csv",
          "remote_version": "2026-02-03",
          "downloads": 100,
          "unique_downloads": 30
        },
        {
          "id": "550e8400-e29b-41d4-a716-446655440001",
          "bucket": "my-bucket",
          "remote_path": "acme/geolocation",
          "remote_filename": "geolocation.mmdb",
          "remote_version": "2026-02-03",
          "downloads": 50,
          "unique_downloads": 15
        }
      ],
      "total_downloads": 150,
      "total_unique_downloads": 45
    }
  ],
  "period": { "start": 1704067200000, "end": 1706745600000 }
}
  • timestamp: Start of the time bucket (e.g., for scale=day, midnight UTC of that day; for scale=month, YYYYMM integer like 202401)
  • id: File ID from the index (null if file not in index)

Analytics: Summary

GET /analytics/summary

Returns aggregate statistics for a time period.

Query Parameters: Same as Time Series (start, end, file filters).

Example Request:

curl "https://r2index.acme.com/analytics/summary?start=1704067200000&end=1706745600000"

Response:

{
  "total_downloads": 1234,
  "unique_downloads": 567,
  "top_user_agents": [
    { "user_agent": "Chrome/120", "downloads": 500 },
    { "user_agent": "Safari/17", "downloads": 300 }
  ],
  "period": { "start": 1704067200000, "end": 1706745600000 }
}

Analytics: By IP

GET /analytics/by-ip

Returns downloads for a specific IP address.

Query Parameters:

Parameter Type Required Description
end integer Yes End timestamp (ms)
ip string Yes IP address to search
limit integer No Max results (default: 100, max: 1000)
offset integer No Pagination offset
start integer Yes Start timestamp (ms)

Example Request:

curl "https://r2index.acme.com/analytics/by-ip?ip=192.168.1.1&start=1704067200000&end=1706745600000"

Response:

{
  "downloads": [
    {
      "bucket": "my-bucket",
      "remote_path": "acme/abuser",
      "remote_filename": "abuser.csv",
      "remote_version": "2026-02-03",
      "downloaded_at": 1704067200000,
      "user_agent": "Chrome/120"
    }
  ],
  "total": 45
}

Analytics: User Agents

GET /analytics/user-agents

Returns download statistics grouped by user agent.

Query Parameters: Same as Time Series, plus limit (default: 20, max: 100).

Example Request:

curl "https://r2index.acme.com/analytics/user-agents?start=1704067200000&end=1706745600000&limit=10"

Response:

{
  "user_agents": [
    { "user_agent": "Chrome/120", "downloads": 500, "unique_ips": 234 },
    { "user_agent": "Safari/17", "downloads": 300, "unique_ips": 156 }
  ],
  "period": { "start": 1704067200000, "end": 1706745600000 }
}

Maintenance: Cleanup Downloads

POST /maintenance/cleanup-downloads

Deletes download records older than DOWNLOADS_RETENTION_DAYS (default: 365 days). Call this endpoint periodically (e.g., daily via cron or Cloudflare Cron Triggers) to keep the database size manageable.

Example Request:

curl -X POST "https://r2index.acme.com/maintenance/cleanup-downloads" \
  -H "Authorization: Bearer <token>"

Response:

{
  "deleted_count": 1234,
  "retention_days": 365
}

Cloudflare Cron Trigger Example:

Add to wrangler.toml:

[triggers]
crons = ["0 2 * * *"]  # Run daily at 2 AM UTC

Then handle the scheduled event in your worker:

export default {
  async scheduled(event: ScheduledEvent, env: Env, ctx: ExecutionContext) {
    const deleted = await cleanupOldDownloads(env.DB, parseInt(env.DOWNLOADS_RETENTION_DAYS || '365', 10));
    console.log(`Cleanup: deleted ${deleted} old download records`);
  },
  // ... fetch handler
};

Database Schema

files

Column Type Description
bucket TEXT S3/R2 bucket name
category TEXT File category
checksum_md5 TEXT MD5 checksum
checksum_sha1 TEXT SHA1 checksum
checksum_sha256 TEXT SHA256 checksum
checksum_sha512 TEXT SHA512 checksum
created INTEGER Creation timestamp (ms)
deprecated INTEGER Deprecation flag (returns as boolean)
deprecation_reason TEXT Reason for deprecation
entity TEXT Entity type
extension TEXT File extension
extra TEXT JSON metadata
id TEXT Primary key (auto-generated UUID)
media_type TEXT MIME type
metadata_path TEXT Path to metadata file
name TEXT Human-readable name
remote_filename TEXT Filename in R2
remote_path TEXT Path in R2 bucket
remote_version TEXT Version identifier
size INTEGER File size in bytes
updated INTEGER Last update timestamp (ms)

Unique Constraint: (bucket, remote_path, remote_filename, remote_version)

file_tags

Column Type Description
file_id TEXT Foreign key to files.id
tag TEXT Tag value

Primary Key: (file_id, tag)

file_downloads

Column Type Description
bucket TEXT S3/R2 bucket name
day_bucket INTEGER Pre-computed day bucket for fast aggregation
downloaded_at INTEGER Download timestamp (ms)
hour_bucket INTEGER Pre-computed hour bucket for fast aggregation
id TEXT Primary key (auto-generated UUID)
ip_address TEXT Client IP address
month_bucket INTEGER Pre-computed month bucket (YYYYMM format)
remote_filename TEXT Filename in R2
remote_path TEXT Path in R2 bucket
remote_version TEXT Version identifier
user_agent TEXT Client user agent

Indexes: day_bucket, hour_bucket, month_bucket, (bucket, remote_path, remote_filename, remote_version, day_bucket), (ip_address, day_bucket)

Development

# Run locally
npm run dev

# Run unit tests
npm test

# Run unit tests in watch mode
npm run test:watch

# Type check
npx tsc --noEmit

# Deploy
npm run deploy

# Run e2e tests (API only)
python e2e_test.py <api_url> <api_token>

# Run e2e tests with R2 upload/download (includes 5GB large file test)
python e2e_test.py <api_url> <api_token> <r2_access_key_id> <r2_secret_access_key> <r2_account_id>

E2E Tests with Bao

# API-only e2e tests
python e2e_test.py \
  $(bao kv get -field=api-url -namespace=elaunira/production kv/cloudflare/r2index) \
  $(bao kv get -field=api-token -namespace=elaunira/production kv/cloudflare/r2index)

# Full e2e tests including R2 upload/download and 5GB large file test
python e2e_test.py \
  $(bao kv get -field=api-url -namespace=elaunira/production kv/cloudflare/r2index) \
  $(bao kv get -field=api-token -namespace=elaunira/production kv/cloudflare/r2index) \
  $(bao kv get -field=access-key-id -namespace=elaunira/production kv/cloudflare/r2/e2e-tests) \
  $(bao kv get -field=secret-access-key -namespace=elaunira/production kv/cloudflare/r2/e2e-tests) \
  $(bao kv get -field=account-id -namespace=elaunira/production kv/cloudflare/r2/e2e-tests)

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 53.7%
  • TypeScript 46.0%
  • Shell 0.3%