R2 Index

A queryable metadata index for files stored in Cloudflare R2. Built as a Cloudflare Worker with D1 database, it enables searching, filtering, and organizing files by category, entity, tags, and custom metadata — without scanning the object store.

Architecture

Client (Airflow, etc.)
    │
    ├─► Worker API ─► D1 (metadata CRUD/search)
    │
    └─► R2 (direct upload/download via S3-compatible API)

The Worker handles metadata only. File content is uploaded/downloaded directly to R2 using the S3-compatible API.

Who uses it?


Ipregistry	Noticeable	OpenPlanetData

Setup

1. Install dependencies

npm install

2. Create D1 database

wrangler d1 create r2index

Update wrangler.toml with the returned database_id.

3. Apply schema

npm run db:init

4. Set API token

wrangler secret put R2INDEX_API_TOKEN

5. Deploy

npm run deploy

Configuration

wrangler.toml

name = "r2index"
main = "src/index.ts"
compatibility_date = "2026-01-31"

routes = [
  { pattern = "r2index.acme.com/*", zone_name = "acme.com" }
]

[[d1_databases]]
binding = "D1"
database_name = "r2index"
database_id = "<your-database-id>"

[vars]
CACHE_MAX_AGE = "60"
# DOWNLOADS_RETENTION_DAYS = "365"

Environment Variables

Variable	Description	Default
`CACHE_MAX_AGE`	Cache-Control max-age in seconds	`60`
`DOWNLOADS_RETENTION_DAYS`	Days to keep download records before cleanup	`365`
`R2INDEX_API_TOKEN`	Bearer token for API authentication (set via `wrangler secret put`)	Required

Data Model

Core Fields

Field	Description	Example
`category`	Product or service grouping	`acme`
`entity`	Specific dataset identifier	`acme-abuser`, `acme-geolocation`
`extension`	File format	`csv`, `csv.zip`, `mmdb`
`media_type`	MIME type	`text/csv`, `application/zip`
`name`	Human-readable name	`Abuser`, `Geolocation`

Remote Location (Unique Constraint)

The tuple (bucket, remote_path, remote_filename, remote_version) uniquely identifies a file in R2:

Field	Description	Example
`bucket`	S3/R2 bucket name	`my-bucket`
`remote_path`	Directory path in R2	`acme/abuser`
`remote_filename`	File name in R2	`abuser.csv`
`remote_version`	Version identifier	`2026-02-03`, `v1`

Optional Metadata

Field	Description
`checksum_md5`	MD5 hash
`checksum_sha1`	SHA1 hash
`checksum_sha256`	SHA256 hash
`checksum_sha512`	SHA512 hash
`deprecated`	Boolean flag
`deprecation_reason`	Reason for deprecation
`extra`	Arbitrary JSON (e.g., `header_line`, `line_count`)
`metadata_path`	Path to associated metadata file
`size`	File size in bytes
`tags`	Array of tags for filtering

API Reference

All endpoints require authentication via Authorization: Bearer <token> header.

Health Check

GET /health

Returns { "status": "ok" }. No authentication required.

Create/Update File (Upsert)

POST /files

Creates or updates a file based on the unique constraint (bucket, remote_path, remote_filename, remote_version).

Request Body:

{
  "bucket": "my-bucket",
  "category": "acme",
  "entity": "acme-abuser",
  "extension": "csv",
  "media_type": "text/csv",
  "name": "Abuser",
  "remote_path": "acme/abuser",
  "remote_filename": "abuser.csv",
  "remote_version": "2026-02-03",
  "size": 5023465,
  "checksum_md5": "21a165f3ddef92b90dccb0c1bb4e249f",
  "checksum_sha1": "b588c39c691a2bc2cdd81e9f826ae9b5eb163e39",
  "checksum_sha256": "8dac526e40c250f3ad117d05452e04814e2c979754a2e4810d8f85413d188ba6",
  "checksum_sha512": "0f4bdedf66e5ec214aa1302d624913c2137c9cbfe1f81c0a63138c9ddd69d0c0",
  "extra": {
    "header_line": "# ip_start,ip_end",
    "line_count": 169964
  },
  "tags": ["ip", "security"]
}

Field	Type	Required	Description
`bucket`	string	Yes	S3/R2 bucket name
`category`	string	Yes	Product or service grouping (e.g., `acme`)
`checksum_md5`	string	No	MD5 hash
`checksum_sha1`	string	No	SHA1 hash
`checksum_sha256`	string	No	SHA256 hash
`checksum_sha512`	string	No	SHA512 hash
`entity`	string	Yes	Dataset identifier (e.g., `acme-abuser`)
`extension`	string	Yes	File format (e.g., `csv`, `mmdb`)
`extra`	object	No	Arbitrary JSON (merged into nested index output)
`media_type`	string	Yes	MIME type (e.g., `text/csv`)
`metadata_path`	string	No	Path to associated metadata file
`name`	string	No	Human-readable name (e.g., `Abuser`)
`remote_filename`	string	Yes	Filename in R2
`remote_path`	string	Yes	Directory path in R2
`remote_version`	string	Yes	Version identifier (e.g., `2026-02-03`)
`size`	integer	No	File size in bytes
`tags`	string[]	No	Tags for filtering

Response: 201 Created (new) or 200 OK (updated) with file record (includes auto-generated id).

Get File

GET /files/:id

Response: 200 OK with file record or 404 Not Found.

Get File by Remote Tuple

GET /files/by-tuple

Retrieves a file by its unique remote tuple.

Query Parameters:

Parameter	Type	Required	Description
`bucket`	string	Yes	S3/R2 bucket name
`remote_path`	string	Yes	Directory path in R2
`remote_filename`	string	Yes	Filename in R2
`remote_version`	string	Yes	Version identifier

Example Request:

curl "https://r2index.acme.com/files/by-tuple?bucket=my-bucket&remote_path=acme/abuser&remote_filename=abuser.csv&remote_version=2026-02-03"

Response: 200 OK with file record or 404 Not Found.

Update File

PUT /files/:id

Request Body: Any subset of file fields to update.

Response: 200 OK with updated file record.

Delete File by ID

DELETE /files/:id

Removes file metadata from the index. Does not delete the actual file in R2.

Response: 200 OK with { "success": true } or 404 Not Found.

Delete File by Remote Tuple

DELETE /files

Removes file metadata from the index. Does not delete the actual file in R2.

Request Body:

{
  "bucket": "my-bucket",
  "remote_path": "acme/abuser",
  "remote_filename": "abuser.csv",
  "remote_version": "2026-02-03"
}

Response: 200 OK with { "success": true } or 404 Not Found.

Search Files

GET /files

Query Parameters:

Parameter	Type	Description
`bucket`	string	Filter by bucket (exact match)
`category`	string	Filter by category (exact match)
`deprecated`	boolean	Filter by deprecated status (`true` or `false`)
`entity`	string	Filter by entity (exact match)
`extension`	string	Filter by extension (exact match)
`limit`	integer	Max results (default: 100, max: 1000)
`media_type`	string	Filter by media type (exact match)
`offset`	integer	Pagination offset (default: 0)
`tags`	string	Filter by tags (comma-separated, must have ALL)
`group_by`	string	Group results by field: `bucket`, `category`, `entity`, `extension`, `media_type`, `deprecated`

Example Requests:

# Get all files
curl "https://r2index.acme.com/files"

# Filter by category
curl "https://r2index.acme.com/files?category=acme"

# Filter by category and entity
curl "https://r2index.acme.com/files?category=acme&entity=acme-abuser"

# Filter by extension
curl "https://r2index.acme.com/files?extension=csv"

# Filter by tags (must have ALL specified tags)
curl "https://r2index.acme.com/files?tags=ip,security"

# Filter non-deprecated files only
curl "https://r2index.acme.com/files?deprecated=false"

# Combine filters with pagination
curl "https://r2index.acme.com/files?category=acme&extension=csv&limit=50&offset=0"

# Group by extension for a given category
curl "https://r2index.acme.com/files?category=acme&group_by=extension"

Grouped Response (when using group_by):

{
  "groups": [
    { "value": "csv", "count": 14 },
    { "value": "csv.zip", "count": 14 },
    { "value": "mmdb", "count": 14 }
  ],
  "total": 42
}

Response:

{
  "files": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "bucket": "my-bucket",
      "name": "Abuser",
      "category": "acme",
      "entity": "acme-abuser",
      "extension": "csv",
      "media_type": "text/csv",
      "remote_path": "acme/abuser",
      "remote_filename": "abuser.csv",
      "remote_version": "2026-02-03",
      "metadata_path": null,
      "size": 5023465,
      "checksum_md5": "21a165f3ddef92b90dccb0c1bb4e249f",
      "checksum_sha1": "b588c39c691a2bc2cdd81e9f826ae9b5eb163e39",
      "checksum_sha256": "8dac526e40c250f3ad117d05452e04814e2c979754a2e4810d8f85413d188ba6",
      "checksum_sha512": "0f4bdedf66e5ec214aa1302d624913c2137c9cbfe1f81c0a63138c9ddd69d0c0",
      "extra": {
        "header_line": "# ip_start,ip_end",
        "line_count": 169964
      },
      "deprecated": false,
      "deprecation_reason": "",
      "created": 1706918150000,
      "updated": 1706918150000,
      "tags": ["ip", "security"]
    }
  ],
  "total": 1
}

Get Nested Index

GET /files/index

Returns files grouped by entity then by extension in a nested structure. Useful for generating compatibility indexes.

Query Parameters: Same filters as Search Files (except limit, offset, group_by).

Example Request:

curl "https://r2index.acme.com/files/index?category=acme"

Response:

{
  "acme-abuser": {
    "csv": {
      "checksums": {
        "md5": "21a165f3ddef92b90dccb0c1bb4e249f",
        "sha1": "b588c39c691a2bc2cdd81e9f826ae9b5eb163e39",
        "sha256": "8dac526e40c250f3ad117d05452e04814e2c979754a2e4810d8f85413d188ba6",
        "sha512": "0f4bdedf66e5ec214aa1302d624913c2137c9cbfe1f81c0a63138c9ddd69d0c0c8ffc5f296a3c6d9c4256b86ca2a18b08c34e7f6897d152c16dde6526a07461f"
      },
      "file_size": "5023465",
      "last_updated": "2026-02-03T18:55:50.000Z",
      "name": "Abuser",
      "header_line": "# ip_start,ip_end",
      "line_count": 169964
    },
    "mmdb": {
      "checksums": {
        "md5": "2cf1f2d9ae301714b7ed7979553c76be"
      },
      "file_size": "10573656",
      "last_updated": "2026-02-03T18:55:50.000Z"
    }
  },
  "acme-as": {
    "csv": {
      "checksums": {
        "md5": "8356651e9e7bfdbf94fa41c9911d9cdf"
      },
      "file_size": "7355656",
      "last_updated": "2026-02-03T18:54:02.000Z",
      "name": "AS"
    }
  }
}

Extra fields from the extra JSON column are merged into each entry.

Record Download

POST /downloads

Records a file download event for analytics tracking.

Request Body:

{
  "bucket": "my-bucket",
  "remote_path": "acme/abuser",
  "remote_filename": "abuser.csv",
  "remote_version": "2026-02-03",
  "ip_address": "192.168.1.1",
  "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"
}

Field	Type	Required	Description
`bucket`	string	Yes	S3/R2 bucket name
`ip_address`	string	Yes	Client IP address (IPv4 or IPv6)
`remote_filename`	string	Yes	File name in R2
`remote_path`	string	Yes	Directory path in R2
`remote_version`	string	Yes	Version identifier
`user_agent`	string	No	Client user agent string

Example Request:

curl -X POST "https://r2index.acme.com/downloads" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "my-bucket",
    "remote_path": "acme/abuser",
    "remote_filename": "abuser.csv",
    "remote_version": "2026-02-03",
    "ip_address": "192.168.1.1",
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"
  }'

Response: 201 Created with download record including pre-computed time buckets.

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "bucket": "my-bucket",
  "remote_path": "acme/abuser",
  "remote_filename": "abuser.csv",
  "remote_version": "2026-02-03",
  "ip_address": "192.168.1.1",
  "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
  "downloaded_at": 1706918150000,
  "hour_bucket": 1706914800000,
  "day_bucket": 1706832000000,
  "month_bucket": 202402
}

Analytics: Time Series

GET /analytics/timeseries

Returns download counts over time, grouped by hour, day, or month.

Query Parameters:

Parameter	Type	Required	Description
`bucket`	string	No	Filter by bucket
`end`	integer	Yes	End timestamp (ms)
`limit`	integer	No	Max files per bucket (default: 100, max: 1000)
`remote_filename`	string	No	Filter by remote filename
`remote_path`	string	No	Filter by remote path
`remote_version`	string	No	Filter by remote version
`scale`	string	No	Time bucket: `hour`, `day`, `month` (default: `day`)
`start`	integer	Yes	Start timestamp (ms)

Example Request:

curl "https://r2index.acme.com/analytics/timeseries?start=1704067200000&end=1706745600000&scale=day"

Response:

{
  "scale": "day",
  "buckets": [
    {
      "timestamp": 1704067200000,
      "files": [
        {
          "id": "550e8400-e29b-41d4-a716-446655440000",
          "bucket": "my-bucket",
          "remote_path": "acme/abuser",
          "remote_filename": "abuser.csv",
          "remote_version": "2026-02-03",
          "downloads": 100,
          "unique_downloads": 30
        },
        {
          "id": "550e8400-e29b-41d4-a716-446655440001",
          "bucket": "my-bucket",
          "remote_path": "acme/geolocation",
          "remote_filename": "geolocation.mmdb",
          "remote_version": "2026-02-03",
          "downloads": 50,
          "unique_downloads": 15
        }
      ],
      "total_downloads": 150,
      "total_unique_downloads": 45
    }
  ],
  "period": { "start": 1704067200000, "end": 1706745600000 }
}

timestamp: Start of the time bucket (e.g., for scale=day, midnight UTC of that day; for scale=month, YYYYMM integer like 202401)
id: File ID from the index (null if file not in index)

Analytics: Summary

GET /analytics/summary

Returns aggregate statistics for a time period.

Query Parameters: Same as Time Series (start, end, file filters).

Example Request:

curl "https://r2index.acme.com/analytics/summary?start=1704067200000&end=1706745600000"

Response:

{
  "total_downloads": 1234,
  "unique_downloads": 567,
  "top_user_agents": [
    { "user_agent": "Chrome/120", "downloads": 500 },
    { "user_agent": "Safari/17", "downloads": 300 }
  ],
  "period": { "start": 1704067200000, "end": 1706745600000 }
}

Analytics: By IP

GET /analytics/by-ip

Returns downloads for a specific IP address.

Query Parameters:

Parameter	Type	Required	Description
`end`	integer	Yes	End timestamp (ms)
`ip`	string	Yes	IP address to search
`limit`	integer	No	Max results (default: 100, max: 1000)
`offset`	integer	No	Pagination offset
`start`	integer	Yes	Start timestamp (ms)

Example Request:

curl "https://r2index.acme.com/analytics/by-ip?ip=192.168.1.1&start=1704067200000&end=1706745600000"

Response:

{
  "downloads": [
    {
      "bucket": "my-bucket",
      "remote_path": "acme/abuser",
      "remote_filename": "abuser.csv",
      "remote_version": "2026-02-03",
      "downloaded_at": 1704067200000,
      "user_agent": "Chrome/120"
    }
  ],
  "total": 45
}

Analytics: User Agents

GET /analytics/user-agents

Returns download statistics grouped by user agent.

Query Parameters: Same as Time Series, plus limit (default: 20, max: 100).

Example Request:

curl "https://r2index.acme.com/analytics/user-agents?start=1704067200000&end=1706745600000&limit=10"

Response:

{
  "user_agents": [
    { "user_agent": "Chrome/120", "downloads": 500, "unique_ips": 234 },
    { "user_agent": "Safari/17", "downloads": 300, "unique_ips": 156 }
  ],
  "period": { "start": 1704067200000, "end": 1706745600000 }
}

Maintenance: Cleanup Downloads

POST /maintenance/cleanup-downloads

Deletes download records older than DOWNLOADS_RETENTION_DAYS (default: 365 days). Call this endpoint periodically (e.g., daily via cron or Cloudflare Cron Triggers) to keep the database size manageable.

Example Request:

curl -X POST "https://r2index.acme.com/maintenance/cleanup-downloads" \
  -H "Authorization: Bearer <token>"

Response:

{
  "deleted_count": 1234,
  "retention_days": 365
}

Cloudflare Cron Trigger Example:

Add to wrangler.toml:

[triggers]
crons = ["0 2 * * *"]  # Run daily at 2 AM UTC

Then handle the scheduled event in your worker:

export default {
  async scheduled(event: ScheduledEvent, env: Env, ctx: ExecutionContext) {
    const deleted = await cleanupOldDownloads(env.DB, parseInt(env.DOWNLOADS_RETENTION_DAYS || '365', 10));
    console.log(`Cleanup: deleted ${deleted} old download records`);
  },
  // ... fetch handler
};

Database Schema

files

Column	Type	Description
`bucket`	TEXT	S3/R2 bucket name
`category`	TEXT	File category
`checksum_md5`	TEXT	MD5 checksum
`checksum_sha1`	TEXT	SHA1 checksum
`checksum_sha256`	TEXT	SHA256 checksum
`checksum_sha512`	TEXT	SHA512 checksum
`created`	INTEGER	Creation timestamp (ms)
`deprecated`	INTEGER	Deprecation flag (returns as boolean)
`deprecation_reason`	TEXT	Reason for deprecation
`entity`	TEXT	Entity type
`extension`	TEXT	File extension
`extra`	TEXT	JSON metadata
`id`	TEXT	Primary key (auto-generated UUID)
`media_type`	TEXT	MIME type
`metadata_path`	TEXT	Path to metadata file
`name`	TEXT	Human-readable name
`remote_filename`	TEXT	Filename in R2
`remote_path`	TEXT	Path in R2 bucket
`remote_version`	TEXT	Version identifier
`size`	INTEGER	File size in bytes
`updated`	INTEGER	Last update timestamp (ms)

Unique Constraint: (bucket, remote_path, remote_filename, remote_version)

file_tags

Column	Type	Description
`file_id`	TEXT	Foreign key to files.id
`tag`	TEXT	Tag value

Primary Key: (file_id, tag)

file_downloads

Column	Type	Description
`bucket`	TEXT	S3/R2 bucket name
`day_bucket`	INTEGER	Pre-computed day bucket for fast aggregation
`downloaded_at`	INTEGER	Download timestamp (ms)
`hour_bucket`	INTEGER	Pre-computed hour bucket for fast aggregation
`id`	TEXT	Primary key (auto-generated UUID)
`ip_address`	TEXT	Client IP address
`month_bucket`	INTEGER	Pre-computed month bucket (YYYYMM format)
`remote_filename`	TEXT	Filename in R2
`remote_path`	TEXT	Path in R2 bucket
`remote_version`	TEXT	Version identifier
`user_agent`	TEXT	Client user agent

Indexes: day_bucket, hour_bucket, month_bucket, (bucket, remote_path, remote_filename, remote_version, day_bucket), (ip_address, day_bucket)

Development

# Run locally
npm run dev

# Run unit tests
npm test

# Run unit tests in watch mode
npm run test:watch

# Type check
npx tsc --noEmit

# Deploy
npm run deploy

# Run e2e tests (API only)
python e2e_test.py <api_url> <api_token>

# Run e2e tests with R2 upload/download (includes 5GB large file test)
python e2e_test.py <api_url> <api_token> <r2_access_key_id> <r2_secret_access_key> <r2_account_id>

E2E Tests with Bao

# API-only e2e tests
python e2e_test.py \
  $(bao kv get -field=api-url -namespace=elaunira/production kv/cloudflare/r2index) \
  $(bao kv get -field=api-token -namespace=elaunira/production kv/cloudflare/r2index)

# Full e2e tests including R2 upload/download and 5GB large file test
python e2e_test.py \
  $(bao kv get -field=api-url -namespace=elaunira/production kv/cloudflare/r2index) \
  $(bao kv get -field=api-token -namespace=elaunira/production kv/cloudflare/r2index) \
  $(bao kv get -field=access-key-id -namespace=elaunira/production kv/cloudflare/r2/e2e-tests) \
  $(bao kv get -field=secret-access-key -namespace=elaunira/production kv/cloudflare/r2/e2e-tests) \
  $(bao kv get -field=account-id -namespace=elaunira/production kv/cloudflare/r2/e2e-tests)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
python		python
src		src
.gitignore		.gitignore
README.md		README.md
e2e_test.py		e2e_test.py
openplanetdata-r2index.iml		openplanetdata-r2index.iml
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts
wrangler.toml		wrangler.toml

openplanetdata/openplanetdata-r2index

Folders and files

Latest commit

History

Repository files navigation

R2 Index

Architecture

Who uses it?

Setup

1. Install dependencies

2. Create D1 database

3. Apply schema

4. Set API token

5. Deploy

Configuration

wrangler.toml

Environment Variables

Data Model

Core Fields

Remote Location (Unique Constraint)

Optional Metadata

API Reference

Health Check

Create/Update File (Upsert)

Get File

Get File by Remote Tuple

Update File

Delete File by ID

Delete File by Remote Tuple

Search Files

Get Nested Index

Record Download

Analytics: Time Series

Analytics: Summary

Analytics: By IP

Analytics: User Agents

Maintenance: Cleanup Downloads

Database Schema

files

file_tags

file_downloads

Development

E2E Tests with Bao

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages