Skip to content

feat!: rearchitect proxy with Cloudflare Workers-compatible API#109

Draft
alukach wants to merge 78 commits intomainfrom
refactor/cf-workers-proxy
Draft

feat!: rearchitect proxy with Cloudflare Workers-compatible API#109
alukach wants to merge 78 commits intomainfrom
refactor/cf-workers-proxy

Conversation

@alukach
Copy link
Contributor

@alukach alukach commented Feb 26, 2026

Important

This is not yet ready. I'm creating this PR to document the idea and its progress.

Note

This is a continuation of #108

What I'm changing

This PR performs a complete rebuild of the Source Data Proxy into a system of modular crates that can be composed to build proxies for various runtimes. This allows us to build for the Cloudflare Workers runtime by compiling to WASM.

Overall Architecture

I recommend that the proxy run in a multi-environment fashion, wherein clients could select a proxy that best fits their needs:

  • data.source.coop - Workers based deployment for general usage by users in non-cloud or cross-cloud environments.
  • {region}.data.source.coop - Cloud provider specific deployments of the data proxy to support in-region access (avoiding egress fees and promoting high throughput). These can be built-for and deployed-to traditional runtime environments, such as AWS ECS Fargate (as we do today).

Why Workers?

I think that Cloudflare Workers is the ideal runtime environment for most Source Data access for the following reasons:

  • Serverless: No infrastructure or scaling requirements for reduced operational burden.
  • No cold-start: Workers are more akin to browser tabs than separate applications, wherein they share a process but run in an isolated context 1. This means that they start much faster than serverless options like AWS Lambda, with a lower cold-start time than the TLS handshake that takes place during the connection (ie effectively, no cold-start)2.
  • No wall-clock timeout: There is only a CPU-timeout, not a wall-clock timeout for Workers. This means that a worker can serve a download that takes many minutes, hours, or even days.
  • Globally distributed: Workers run in Cloudflares CDN network datacenters, meaning that a client will be connecting with a server that is likely much closer to their physical location, reducing latency. The connection is then tunneled over Cloudflare's data backbone3 to the origin (eg S3 bucket in an AWS region) which is most likely much faster than that same connection being made over the public internet.
  • No egress fees: Workers do not charge based on egress. While each data source backend provider (e.g. AWS, GCP, Azure) likely will have their own egress fees, the proxy service will not add to those costs.
  • Reasonable pricing: $5/mo includes first 10M requests, $0.30 for subsequent million requests4

Performance Comparisons

Using AWS Cloudshell, I downloaded a single 73MB file (cholmes/admin-boundaries/countries.parquet) from us-west-2 (Oregon), ap-south-1 (Mumbai), and eu-west-3 (Paris) and compared the result5. Across regions, the Cloudflare proxy significantly outperforms the Legacy Source proxy, primarily by reducing DNS, connection, and TLS handshake latency—often by 3–6×—which lowers TTFB by roughly 20–40% and improves throughput by up to ~25%. In-region (us-west-2), Direct S3 remains marginally fastest, but Cloudflare adds only modest overhead and still materially outperforms Legacy. In cross-region scenarios (Paris and Mumbai), Cloudflare eliminates most of the connection and TLS penalties seen with Legacy and, in some cases, matches or slightly exceeds Direct S3 total transfer performance due to optimized edge termination and backbone routing. Overall, Cloudflare removes the bulk of proxy-induced latency while delivering more consistent global performance than both Legacy and, at distance, even direct S3 access.

How I did it

Note

Almost all of this codebase was written by Claude Code via Opus 4.6.

The key challenge when working with Cloudflare Workers is to avoid hitting the CPU timeout. This challenge is particularly apparent when dealing with large streams of data. Given that the Cloudflare Workers uses the V8 runtime, we must compile our system to WASM. Cloudflare Workers exposes the request and response bodies as native JS ReadableStreams, exposed as a web_sys::ReadableStream. It's critical that these streams NOT be transformed to a ByteStream as this exhausts the CPU timeout for any bodies greater than ~70MB. As such, the system was written to allow each runtime environment to define its stream format and for those streams to be passed between the incoming requests and the backend fileserver or vice-versa.

How to test it

The system is currently deployed to https://s3-proxy-rs.alukach.workers.dev with a subset of data. For experimentation, try to access either the cholmes or harvard-lil buckets.

TODO

  • Can we make an allowlist exception on the Vercel Firewall for an ephemeral service like Cloudflare Workers to avoid rate-limiting?
  • Properly integrate with Source API (currently, Source Backend has troubles listing buckets and bucket contents)
  • Re-add github workflows
  • Implement IaC via Terraform6
  • Verify we want to brand proxy codebase with Source Cooperative
  • Define & document how a user installs the source coop CLI
  • Add documentation website

Related Issues

closes #1

Footnotes

  1. https://developers.cloudflare.com/workers/reference/how-workers-works/#isolates

  2. https://blog.cloudflare.com/eliminating-cold-starts-with-cloudflare-workers/

  3. https://blog.cloudflare.com/backbone2024/

  4. https://developers.cloudflare.com/workers/platform/pricing/#workers

  5. https://gist.github.com/alukach/416f5f588d0305034801369932e0ce40

  6. https://developers.cloudflare.com/workers/platform/infrastructure-as-code/#terraform

Replaced raw format! string interpolation of upload_id and part_number into query strings with url::form_urlencoded::Serializer::append_pair(), which properly percent-encodes special characters (&, =, etc.) in  both UploadPart and CompleteMultipartUpload/AbortMultipartUpload URL construction.
Added validate_path_segment() that rejects values containing /, \, \0, .., ., or empty strings. Called before every format!("/path/{}", user_input) interpolation in get_bucket, get_role, get_credential, and get_temporary_credential.
Changed all four tracing::debug! calls in dispatch_operation from url =  %fwd.url (which logged the full presigned URL including auth signatures in query params) to path = fwd.url.path() (which logs only the URL path — bucket and key, no credentials). The multipart backend_url log on line 418 was left as-is since that URL doesn't contain presigned auth params.
@alukach alukach force-pushed the refactor/cf-workers-proxy branch from ad208aa to 3b78af5 Compare February 27, 2026 04:50
alukach and others added 16 commits February 27, 2026 15:52
## Summary
- Adds a comprehensive VitePress documentation site in `docs/` covering
authentication, configuration, deployment, architecture, and extension
points
- Organized into user-facing guide (accessing data) and admin-facing
sections (deploying/configuring the proxy)
- Styled to match docs.source.coop visual identity: IBM Plex Sans body,
Cascadia Mono headings, warm off-white/teal-gray color scheme

## Test plan
- [ ] `cd docs && pnpm install && pnpm docs:dev` — site builds and
serves locally
- [ ] Navigate all sidebar links — no broken links
- [ ] Mermaid diagrams render correctly
- [ ] Light and dark themes both render properly
- [ ] Code examples are syntactically valid

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@alukach alukach force-pushed the refactor/cf-workers-proxy branch from 143dfc2 to bbd7377 Compare March 2, 2026 18:20
@alukach alukach force-pushed the refactor/cf-workers-proxy branch from 68c3746 to 8ea93f8 Compare March 2, 2026 20:01
alukach and others added 7 commits March 2, 2026 12:10
…koff

Reject non-HTTPS OIDC issuer URLs per the OIDC spec to prevent MITM
attacks. Cache failed JWKS fetches for 30s to avoid hammering broken
endpoints on repeated STS requests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move canonical request details to debug level and stop logging
expected/provided signatures entirely. Add access key and token
context to sealed token unsealing failures for easier debugging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Major refactor required

1 participant