A queryable metadata index for files stored in Cloudflare R2. Built as a Cloudflare Worker with D1 database, it enables searching, filtering, and organizing files by category, entity, tags, and custom metadata — without scanning the object store.
Client (Airflow, etc.)
│
├─► Worker API ─► D1 (metadata CRUD/search)
│
└─► R2 (direct upload/download via S3-compatible API)
The Worker handles metadata only. File content is uploaded/downloaded directly to R2 using the S3-compatible API.
| Ipregistry | Noticeable | OpenPlanetData |
npm installwrangler d1 create r2indexUpdate wrangler.toml with the returned database_id.
npm run db:initwrangler secret put R2INDEX_API_TOKENnpm run deployname = "r2index"
main = "src/index.ts"
compatibility_date = "2026-01-31"
routes = [
{ pattern = "r2index.acme.com/*", zone_name = "acme.com" }
]
[[d1_databases]]
binding = "D1"
database_name = "r2index"
database_id = "<your-database-id>"
[vars]
CACHE_MAX_AGE = "60"
# DOWNLOADS_RETENTION_DAYS = "365"| Variable | Description | Default |
|---|---|---|
CACHE_MAX_AGE |
Cache-Control max-age in seconds | 60 |
DOWNLOADS_RETENTION_DAYS |
Days to keep download records before cleanup | 365 |
R2INDEX_API_TOKEN |
Bearer token for API authentication (set via wrangler secret put) |
Required |
| Field | Description | Example |
|---|---|---|
category |
Product or service grouping | acme |
entity |
Specific dataset identifier | acme-abuser, acme-geolocation |
extension |
File format | csv, csv.zip, mmdb |
media_type |
MIME type | text/csv, application/zip |
name |
Human-readable name | Abuser, Geolocation |
The tuple (bucket, remote_path, remote_filename, remote_version) uniquely identifies a file in R2:
| Field | Description | Example |
|---|---|---|
bucket |
S3/R2 bucket name | my-bucket |
remote_path |
Directory path in R2 | acme/abuser |
remote_filename |
File name in R2 | abuser.csv |
remote_version |
Version identifier | 2026-02-03, v1 |
| Field | Description |
|---|---|
checksum_md5 |
MD5 hash |
checksum_sha1 |
SHA1 hash |
checksum_sha256 |
SHA256 hash |
checksum_sha512 |
SHA512 hash |
deprecated |
Boolean flag |
deprecation_reason |
Reason for deprecation |
extra |
Arbitrary JSON (e.g., header_line, line_count) |
metadata_path |
Path to associated metadata file |
size |
File size in bytes |
tags |
Array of tags for filtering |
All endpoints require authentication via Authorization: Bearer <token> header.
GET /health
Returns { "status": "ok" }. No authentication required.
POST /files
Creates or updates a file based on the unique constraint (bucket, remote_path, remote_filename, remote_version).
Request Body:
{
"bucket": "my-bucket",
"category": "acme",
"entity": "acme-abuser",
"extension": "csv",
"media_type": "text/csv",
"name": "Abuser",
"remote_path": "acme/abuser",
"remote_filename": "abuser.csv",
"remote_version": "2026-02-03",
"size": 5023465,
"checksum_md5": "21a165f3ddef92b90dccb0c1bb4e249f",
"checksum_sha1": "b588c39c691a2bc2cdd81e9f826ae9b5eb163e39",
"checksum_sha256": "8dac526e40c250f3ad117d05452e04814e2c979754a2e4810d8f85413d188ba6",
"checksum_sha512": "0f4bdedf66e5ec214aa1302d624913c2137c9cbfe1f81c0a63138c9ddd69d0c0",
"extra": {
"header_line": "# ip_start,ip_end",
"line_count": 169964
},
"tags": ["ip", "security"]
}| Field | Type | Required | Description |
|---|---|---|---|
bucket |
string | Yes | S3/R2 bucket name |
category |
string | Yes | Product or service grouping (e.g., acme) |
checksum_md5 |
string | No | MD5 hash |
checksum_sha1 |
string | No | SHA1 hash |
checksum_sha256 |
string | No | SHA256 hash |
checksum_sha512 |
string | No | SHA512 hash |
entity |
string | Yes | Dataset identifier (e.g., acme-abuser) |
extension |
string | Yes | File format (e.g., csv, mmdb) |
extra |
object | No | Arbitrary JSON (merged into nested index output) |
media_type |
string | Yes | MIME type (e.g., text/csv) |
metadata_path |
string | No | Path to associated metadata file |
name |
string | No | Human-readable name (e.g., Abuser) |
remote_filename |
string | Yes | Filename in R2 |
remote_path |
string | Yes | Directory path in R2 |
remote_version |
string | Yes | Version identifier (e.g., 2026-02-03) |
size |
integer | No | File size in bytes |
tags |
string[] | No | Tags for filtering |
Response: 201 Created (new) or 200 OK (updated) with file record (includes auto-generated id).
GET /files/:id
Response: 200 OK with file record or 404 Not Found.
GET /files/by-tuple
Retrieves a file by its unique remote tuple.
Query Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
bucket |
string | Yes | S3/R2 bucket name |
remote_path |
string | Yes | Directory path in R2 |
remote_filename |
string | Yes | Filename in R2 |
remote_version |
string | Yes | Version identifier |
Example Request:
curl "https://r2index.acme.com/files/by-tuple?bucket=my-bucket&remote_path=acme/abuser&remote_filename=abuser.csv&remote_version=2026-02-03"Response: 200 OK with file record or 404 Not Found.
PUT /files/:id
Request Body: Any subset of file fields to update.
Response: 200 OK with updated file record.
DELETE /files/:id
Removes file metadata from the index. Does not delete the actual file in R2.
Response: 200 OK with { "success": true } or 404 Not Found.
DELETE /files
Removes file metadata from the index. Does not delete the actual file in R2.
Request Body:
{
"bucket": "my-bucket",
"remote_path": "acme/abuser",
"remote_filename": "abuser.csv",
"remote_version": "2026-02-03"
}Response: 200 OK with { "success": true } or 404 Not Found.
GET /files
Query Parameters:
| Parameter | Type | Description |
|---|---|---|
bucket |
string | Filter by bucket (exact match) |
category |
string | Filter by category (exact match) |
deprecated |
boolean | Filter by deprecated status (true or false) |
entity |
string | Filter by entity (exact match) |
extension |
string | Filter by extension (exact match) |
limit |
integer | Max results (default: 100, max: 1000) |
media_type |
string | Filter by media type (exact match) |
offset |
integer | Pagination offset (default: 0) |
tags |
string | Filter by tags (comma-separated, must have ALL) |
group_by |
string | Group results by field: bucket, category, entity, extension, media_type, deprecated |
Example Requests:
# Get all files
curl "https://r2index.acme.com/files"
# Filter by category
curl "https://r2index.acme.com/files?category=acme"
# Filter by category and entity
curl "https://r2index.acme.com/files?category=acme&entity=acme-abuser"
# Filter by extension
curl "https://r2index.acme.com/files?extension=csv"
# Filter by tags (must have ALL specified tags)
curl "https://r2index.acme.com/files?tags=ip,security"
# Filter non-deprecated files only
curl "https://r2index.acme.com/files?deprecated=false"
# Combine filters with pagination
curl "https://r2index.acme.com/files?category=acme&extension=csv&limit=50&offset=0"
# Group by extension for a given category
curl "https://r2index.acme.com/files?category=acme&group_by=extension"Grouped Response (when using group_by):
{
"groups": [
{ "value": "csv", "count": 14 },
{ "value": "csv.zip", "count": 14 },
{ "value": "mmdb", "count": 14 }
],
"total": 42
}Response:
{
"files": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"bucket": "my-bucket",
"name": "Abuser",
"category": "acme",
"entity": "acme-abuser",
"extension": "csv",
"media_type": "text/csv",
"remote_path": "acme/abuser",
"remote_filename": "abuser.csv",
"remote_version": "2026-02-03",
"metadata_path": null,
"size": 5023465,
"checksum_md5": "21a165f3ddef92b90dccb0c1bb4e249f",
"checksum_sha1": "b588c39c691a2bc2cdd81e9f826ae9b5eb163e39",
"checksum_sha256": "8dac526e40c250f3ad117d05452e04814e2c979754a2e4810d8f85413d188ba6",
"checksum_sha512": "0f4bdedf66e5ec214aa1302d624913c2137c9cbfe1f81c0a63138c9ddd69d0c0",
"extra": {
"header_line": "# ip_start,ip_end",
"line_count": 169964
},
"deprecated": false,
"deprecation_reason": "",
"created": 1706918150000,
"updated": 1706918150000,
"tags": ["ip", "security"]
}
],
"total": 1
}GET /files/index
Returns files grouped by entity then by extension in a nested structure. Useful for generating compatibility indexes.
Query Parameters: Same filters as Search Files (except limit, offset, group_by).
Example Request:
curl "https://r2index.acme.com/files/index?category=acme"Response:
{
"acme-abuser": {
"csv": {
"checksums": {
"md5": "21a165f3ddef92b90dccb0c1bb4e249f",
"sha1": "b588c39c691a2bc2cdd81e9f826ae9b5eb163e39",
"sha256": "8dac526e40c250f3ad117d05452e04814e2c979754a2e4810d8f85413d188ba6",
"sha512": "0f4bdedf66e5ec214aa1302d624913c2137c9cbfe1f81c0a63138c9ddd69d0c0c8ffc5f296a3c6d9c4256b86ca2a18b08c34e7f6897d152c16dde6526a07461f"
},
"file_size": "5023465",
"last_updated": "2026-02-03T18:55:50.000Z",
"name": "Abuser",
"header_line": "# ip_start,ip_end",
"line_count": 169964
},
"mmdb": {
"checksums": {
"md5": "2cf1f2d9ae301714b7ed7979553c76be"
},
"file_size": "10573656",
"last_updated": "2026-02-03T18:55:50.000Z"
}
},
"acme-as": {
"csv": {
"checksums": {
"md5": "8356651e9e7bfdbf94fa41c9911d9cdf"
},
"file_size": "7355656",
"last_updated": "2026-02-03T18:54:02.000Z",
"name": "AS"
}
}
}Extra fields from the extra JSON column are merged into each entry.
POST /downloads
Records a file download event for analytics tracking.
Request Body:
{
"bucket": "my-bucket",
"remote_path": "acme/abuser",
"remote_filename": "abuser.csv",
"remote_version": "2026-02-03",
"ip_address": "192.168.1.1",
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"
}| Field | Type | Required | Description |
|---|---|---|---|
bucket |
string | Yes | S3/R2 bucket name |
ip_address |
string | Yes | Client IP address (IPv4 or IPv6) |
remote_filename |
string | Yes | File name in R2 |
remote_path |
string | Yes | Directory path in R2 |
remote_version |
string | Yes | Version identifier |
user_agent |
string | No | Client user agent string |
Example Request:
curl -X POST "https://r2index.acme.com/downloads" \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"bucket": "my-bucket",
"remote_path": "acme/abuser",
"remote_filename": "abuser.csv",
"remote_version": "2026-02-03",
"ip_address": "192.168.1.1",
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"
}'Response: 201 Created with download record including pre-computed time buckets.
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"bucket": "my-bucket",
"remote_path": "acme/abuser",
"remote_filename": "abuser.csv",
"remote_version": "2026-02-03",
"ip_address": "192.168.1.1",
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
"downloaded_at": 1706918150000,
"hour_bucket": 1706914800000,
"day_bucket": 1706832000000,
"month_bucket": 202402
}GET /analytics/timeseries
Returns download counts over time, grouped by hour, day, or month.
Query Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
bucket |
string | No | Filter by bucket |
end |
integer | Yes | End timestamp (ms) |
limit |
integer | No | Max files per bucket (default: 100, max: 1000) |
remote_filename |
string | No | Filter by remote filename |
remote_path |
string | No | Filter by remote path |
remote_version |
string | No | Filter by remote version |
scale |
string | No | Time bucket: hour, day, month (default: day) |
start |
integer | Yes | Start timestamp (ms) |
Example Request:
curl "https://r2index.acme.com/analytics/timeseries?start=1704067200000&end=1706745600000&scale=day"Response:
{
"scale": "day",
"buckets": [
{
"timestamp": 1704067200000,
"files": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"bucket": "my-bucket",
"remote_path": "acme/abuser",
"remote_filename": "abuser.csv",
"remote_version": "2026-02-03",
"downloads": 100,
"unique_downloads": 30
},
{
"id": "550e8400-e29b-41d4-a716-446655440001",
"bucket": "my-bucket",
"remote_path": "acme/geolocation",
"remote_filename": "geolocation.mmdb",
"remote_version": "2026-02-03",
"downloads": 50,
"unique_downloads": 15
}
],
"total_downloads": 150,
"total_unique_downloads": 45
}
],
"period": { "start": 1704067200000, "end": 1706745600000 }
}timestamp: Start of the time bucket (e.g., forscale=day, midnight UTC of that day; forscale=month, YYYYMM integer like202401)id: File ID from the index (null if file not in index)
GET /analytics/summary
Returns aggregate statistics for a time period.
Query Parameters: Same as Time Series (start, end, file filters).
Example Request:
curl "https://r2index.acme.com/analytics/summary?start=1704067200000&end=1706745600000"Response:
{
"total_downloads": 1234,
"unique_downloads": 567,
"top_user_agents": [
{ "user_agent": "Chrome/120", "downloads": 500 },
{ "user_agent": "Safari/17", "downloads": 300 }
],
"period": { "start": 1704067200000, "end": 1706745600000 }
}GET /analytics/by-ip
Returns downloads for a specific IP address.
Query Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
end |
integer | Yes | End timestamp (ms) |
ip |
string | Yes | IP address to search |
limit |
integer | No | Max results (default: 100, max: 1000) |
offset |
integer | No | Pagination offset |
start |
integer | Yes | Start timestamp (ms) |
Example Request:
curl "https://r2index.acme.com/analytics/by-ip?ip=192.168.1.1&start=1704067200000&end=1706745600000"Response:
{
"downloads": [
{
"bucket": "my-bucket",
"remote_path": "acme/abuser",
"remote_filename": "abuser.csv",
"remote_version": "2026-02-03",
"downloaded_at": 1704067200000,
"user_agent": "Chrome/120"
}
],
"total": 45
}GET /analytics/user-agents
Returns download statistics grouped by user agent.
Query Parameters: Same as Time Series, plus limit (default: 20, max: 100).
Example Request:
curl "https://r2index.acme.com/analytics/user-agents?start=1704067200000&end=1706745600000&limit=10"Response:
{
"user_agents": [
{ "user_agent": "Chrome/120", "downloads": 500, "unique_ips": 234 },
{ "user_agent": "Safari/17", "downloads": 300, "unique_ips": 156 }
],
"period": { "start": 1704067200000, "end": 1706745600000 }
}POST /maintenance/cleanup-downloads
Deletes download records older than DOWNLOADS_RETENTION_DAYS (default: 365 days). Call this endpoint periodically (e.g., daily via cron or Cloudflare Cron Triggers) to keep the database size manageable.
Example Request:
curl -X POST "https://r2index.acme.com/maintenance/cleanup-downloads" \
-H "Authorization: Bearer <token>"Response:
{
"deleted_count": 1234,
"retention_days": 365
}Cloudflare Cron Trigger Example:
Add to wrangler.toml:
[triggers]
crons = ["0 2 * * *"] # Run daily at 2 AM UTCThen handle the scheduled event in your worker:
export default {
async scheduled(event: ScheduledEvent, env: Env, ctx: ExecutionContext) {
const deleted = await cleanupOldDownloads(env.DB, parseInt(env.DOWNLOADS_RETENTION_DAYS || '365', 10));
console.log(`Cleanup: deleted ${deleted} old download records`);
},
// ... fetch handler
};| Column | Type | Description |
|---|---|---|
bucket |
TEXT | S3/R2 bucket name |
category |
TEXT | File category |
checksum_md5 |
TEXT | MD5 checksum |
checksum_sha1 |
TEXT | SHA1 checksum |
checksum_sha256 |
TEXT | SHA256 checksum |
checksum_sha512 |
TEXT | SHA512 checksum |
created |
INTEGER | Creation timestamp (ms) |
deprecated |
INTEGER | Deprecation flag (returns as boolean) |
deprecation_reason |
TEXT | Reason for deprecation |
entity |
TEXT | Entity type |
extension |
TEXT | File extension |
extra |
TEXT | JSON metadata |
id |
TEXT | Primary key (auto-generated UUID) |
media_type |
TEXT | MIME type |
metadata_path |
TEXT | Path to metadata file |
name |
TEXT | Human-readable name |
remote_filename |
TEXT | Filename in R2 |
remote_path |
TEXT | Path in R2 bucket |
remote_version |
TEXT | Version identifier |
size |
INTEGER | File size in bytes |
updated |
INTEGER | Last update timestamp (ms) |
Unique Constraint: (bucket, remote_path, remote_filename, remote_version)
| Column | Type | Description |
|---|---|---|
file_id |
TEXT | Foreign key to files.id |
tag |
TEXT | Tag value |
Primary Key: (file_id, tag)
| Column | Type | Description |
|---|---|---|
bucket |
TEXT | S3/R2 bucket name |
day_bucket |
INTEGER | Pre-computed day bucket for fast aggregation |
downloaded_at |
INTEGER | Download timestamp (ms) |
hour_bucket |
INTEGER | Pre-computed hour bucket for fast aggregation |
id |
TEXT | Primary key (auto-generated UUID) |
ip_address |
TEXT | Client IP address |
month_bucket |
INTEGER | Pre-computed month bucket (YYYYMM format) |
remote_filename |
TEXT | Filename in R2 |
remote_path |
TEXT | Path in R2 bucket |
remote_version |
TEXT | Version identifier |
user_agent |
TEXT | Client user agent |
Indexes: day_bucket, hour_bucket, month_bucket, (bucket, remote_path, remote_filename, remote_version, day_bucket), (ip_address, day_bucket)
# Run locally
npm run dev
# Run unit tests
npm test
# Run unit tests in watch mode
npm run test:watch
# Type check
npx tsc --noEmit
# Deploy
npm run deploy
# Run e2e tests (API only)
python e2e_test.py <api_url> <api_token>
# Run e2e tests with R2 upload/download (includes 5GB large file test)
python e2e_test.py <api_url> <api_token> <r2_access_key_id> <r2_secret_access_key> <r2_account_id># API-only e2e tests
python e2e_test.py \
$(bao kv get -field=api-url -namespace=elaunira/production kv/cloudflare/r2index) \
$(bao kv get -field=api-token -namespace=elaunira/production kv/cloudflare/r2index)
# Full e2e tests including R2 upload/download and 5GB large file test
python e2e_test.py \
$(bao kv get -field=api-url -namespace=elaunira/production kv/cloudflare/r2index) \
$(bao kv get -field=api-token -namespace=elaunira/production kv/cloudflare/r2index) \
$(bao kv get -field=access-key-id -namespace=elaunira/production kv/cloudflare/r2/e2e-tests) \
$(bao kv get -field=secret-access-key -namespace=elaunira/production kv/cloudflare/r2/e2e-tests) \
$(bao kv get -field=account-id -namespace=elaunira/production kv/cloudflare/r2/e2e-tests)MIT