S1

Implementing only a little bit of S3

Overview

S1 is a lightweight S3-compatible API implementation that provides core S3 services for reading data from Google Cloud Storage (GCS). It implements the following S3 APIs:

Implemented Features

GetBucketLocation - Returns the region where the bucket resides
ListObjects - Lists objects in a bucket with support for filtering
GetObject - Retrieves objects from a bucket
SelectObjectContent (S3 Select) - Enables SQL queries on S3 objects for data filtering and transformation

API Endpoints

1. GetBucketLocation

GET /{bucket}?location

Returns the AWS region for the bucket (always returns eu-west-2).

2. ListObjects

GET /{bucket}?delimiter={delimiter}&prefix={prefix}&max-keys={max-keys}&marker={marker}

Lists objects in a bucket. Supports query parameters:

prefix - Limits response to keys that begin with the specified prefix
delimiter - Character used to group keys
max-keys - Maximum number of keys to return (default: 1000)
marker - Key to start with when listing objects

3. GetObject

GET /{bucket}/{object}

Retrieves an object from the bucket.

4. SelectObjectContent (S3 Select)

POST /{bucket}/{object}?select&select-type=2

Performs SQL queries on objects stored in S3. The request body should contain XML with:

SQL expression
Input serialization format (Parquet only)
Output serialization format (CSV or JSON)

Note: The SQL API only supports Parquet files. For accessing other file types (CSV, JSON, etc.), use the GetObject endpoint.

Example Request Body:

<?xml version="1.0" encoding="UTF-8"?>
<SelectObjectContentRequest xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
    <Expression>SELECT * FROM S3Object WHERE price > 100</Expression>
    <ExpressionType>SQL</ExpressionType>
    <InputSerialization>
        <Parquet/>
    </InputSerialization>
    <OutputSerialization>
        <JSON/>
    </OutputSerialization>
</SelectObjectContentRequest>

Supported S3 Select Features

Input Formats: Parquet only
Output Formats: CSV, JSON
SQL Operations:
- SELECT with column specification or wildcard (*)
- Basic WHERE clause filtering
- Queries against S3Object alias

Important: The SQL API (SelectObjectContent) only supports Parquet files. For other file formats like CSV or JSON, use the GetObject API for blob access.

Architecture

The implementation uses:

FastAPI for the web framework
Storage abstraction layer supporting both Google Cloud Storage and local filesystem
LRU caching for improved read performance using Python's functools.lru_cache
XML parsing for S3 Select request handling
Parquet support for SQL queries (other formats available via GetObject)

Running the Service

python src/main.py

The service will start on port 8080 (or the port specified in the PORT environment variable).

Storage Backend

S1 supports two storage backends:

Google Cloud Storage (GCS)

The default backend uses Google Cloud Storage (GCS). When STORAGE_EMULATOR_HOST environment variable is set, it connects to a storage emulator for testing purposes.

Local Filesystem

S1 can also use the local filesystem as a storage backend, which is useful for testing or development environments.

Configuration

The storage backend is configured using environment variables:

STORAGE_BACKEND - Set to gcs (default) or local to choose the backend
STORAGE_CACHE_SIZE - LRU cache size for blob content (default: 128)
LOCAL_STORAGE_PATH - Base path for local filesystem storage (default: /data)
GCS_PROJECT - GCS project name (default: PROJECT)
STORAGE_EMULATOR_HOST - GCS emulator host for testing

LRU Caching

S1 implements LRU (Least Recently Used) caching for blob content reads. This significantly improves performance when the same objects are accessed multiple times. The cache size can be configured using the STORAGE_CACHE_SIZE environment variable.

The caching layer operates transparently for both GCS and local filesystem backends, making S1 an effective caching layer for systems like Opteryx.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
examples		examples
scratch_test		scratch_test
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
test_endpoint.py		test_endpoint.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

S1

Overview

Implemented Features

API Endpoints

1. GetBucketLocation

2. ListObjects

3. GetObject

4. SelectObjectContent (S3 Select)

Example Request Body:

Supported S3 Select Features

Architecture

Running the Service

Storage Backend

Google Cloud Storage (GCS)

Local Filesystem

Configuration

LRU Caching

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

mabel-dev/s1

Folders and files

Latest commit

History

Repository files navigation

S1

Overview

Implemented Features

API Endpoints

1. GetBucketLocation

2. ListObjects

3. GetObject

4. SelectObjectContent (S3 Select)

Example Request Body:

Supported S3 Select Features

Architecture

Running the Service

Storage Backend

Google Cloud Storage (GCS)

Local Filesystem

Configuration

LRU Caching

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages