Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -103,3 +103,5 @@ target/
.pytest_cache/

libs/*.whl

gcsfs/tests/perf/microbenchmarks/__run__
2 changes: 1 addition & 1 deletion .isort.cfg
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[settings]
profile = black
known_third_party = aiohttp,click,decorator,fsspec,fuse,google,google_auth_oauthlib,pytest,requests,setuptools
known_third_party = aiohttp,click,conftest,decorator,fsspec,fuse,google,google_auth_oauthlib,numpy,prettytable,psutil,pytest,requests,resource_monitor,setuptools,yaml
2 changes: 1 addition & 1 deletion cloudbuild/e2e-tests-cloudbuild.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ steps:

pip install --upgrade pip > /dev/null
# Install testing libraries explicitly, as they are not in setup.py
pip install pytest pytest-timeout pytest-subtests pytest-asyncio fusepy google-cloud-storage > /dev/null
pip install pytest pytest-timeout pytest-subtests pytest-asyncio fusepy google-cloud-storage psutil PyYAML > /dev/null
pip install -e . > /dev/null

echo '--- Preparing test environment on VM ---'
Expand Down
5 changes: 5 additions & 0 deletions environment_gcsfs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,17 @@ dependencies:
- google-auth-oauthlib
- google-cloud-core
- google-cloud-storage
- numpy
- grpcio
- pytest
- pytest-benchmark
- pytest-timeout
- pytest-asyncio
- pytest-subtests
- psutil
- ptable
- requests
- ujson
- pyyaml
- pip:
- git+https://github.com/fsspec/filesystem_spec
2 changes: 2 additions & 0 deletions gcsfs/extended_gcsfs.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ def _open(
path,
mode="rb",
block_size=None,
cache_type="readahead",
cache_options=None,
acl=None,
consistency=None,
Expand All @@ -163,6 +164,7 @@ def _open(
path,
mode,
block_size=block_size or self.default_block_size,
cache_type=cache_type,
cache_options=cache_options,
consistency=consistency or self.consistency,
metadata=metadata,
Expand Down
6 changes: 6 additions & 0 deletions gcsfs/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,12 @@

params = dict()

BUCKET_NAME_MAP = {
"regional": TEST_BUCKET,
"zonal": TEST_ZONAL_BUCKET,
"hns": TEST_HNS_BUCKET,
}


def stop_docker(container):
cmd = shlex.split('docker ps -a -q --filter "name=%s"' % container)
Expand Down
192 changes: 192 additions & 0 deletions gcsfs/tests/perf/microbenchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# GCSFS Microbenchmarks

## Introduction

This document describes the microbenchmark suite for `gcsfs`. These benchmarks are designed to measure the performance of various I/O operations under different conditions. They are built using `pytest` and the `pytest-benchmark` plugin to provide detailed performance metrics for single-threaded, multi-threaded, and multi-process scenarios.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: comments are harder with these long lines, recommends sticking to 80chr limits.

I would also test against concurrent.interpreter since it is now available in py3.14, and also free-threaded if all the upstream dependencies support it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrent.interpreter is something i will take in the future PRs


## Prerequisites

Before running the benchmarks, ensure you have installed the project's dependencies for performance testing. This can be done by running the following command from the root of the repository:
```bash
pip install -r gcsfs/tests/perf/microbenchmarks/requirements.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many are using tools like uv and others to specify the dependencies and run command in a more declarative way rather than descriptive README text like this. Might be worth thinking about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have added dependencies to env file as well for conda run to work OOB, i guess for non conda scenarions we will have to live with requirements for now. This is something we can revisit in the future.

```

This will install `pytest`, `pytest-benchmark`, and other necessary dependencies.
For more information on `pytest-benchmark`, you can refer to its official documentation. [1]

## Read Benchmarks

The read benchmarks are located in `gcsfs/tests/perf/microbenchmarks/read/` and are designed to test read performance with various configurations.

### Parameters

The read benchmarks are defined by the `ReadBenchmarkParameters` class in `read/parameters.py`. Key parameters include:

* `name`: The name of the benchmark configuration.
* `num_files`: The number of files to use, this is always num_processes x num_threads.
* `pattern`: Read pattern, either sequential (`seq`) or random (`rand`).
* `num_threads`: Number of threads for multi-threaded tests.
* `num_processes`: Number of processes for multi-process tests.
* `block_size_bytes`: The block size for gcsfs file buffering. Defaults to `16MB`.
* `chunk_size_bytes`: The size of each read operation. Defaults to `16MB`.
* `file_size_bytes`: The total size of each file.
* `rounds`: The total number of pytest-benchmark rounds for each parameterized test. Defaults to `10`.


To ensure that the results are stable and not skewed by outliers, each benchmark is run for a set number of rounds.
By default, this is set to 10 rounds, but it can be configured via `rounds` parameter if needed. This helps in providing a more accurate and reliable performance profile.

### Configurations

The base configurations in `read/configs.yaml` are simplified to just `read_seq` and `read_rand`. Decorators are then used to generate a full suite of test cases by creating variations for parallelism, file sizes, and bucket types.

The benchmarks are split into three main test functions based on the execution model:

* `test_read_single_threaded`: Measures baseline performance of read operations.
* `test_read_multi_threaded`: Measures performance with multiple threads.
* `test_read_multi_process`: Measures performance using multiple processes, each with its own set of threads.

### Running Benchmarks with `pytest`

You can use `pytest` to run the benchmarks directly.
The `GCSFS_BENCHMARK_FILTER` option is useful for filtering tests by name.

**Examples:**

Run all read benchmarks:
```bash
pytest gcsfs/tests/perf/microbenchmarks/read/
```

Run a specific benchmark(s) configuration by setting `GCSFS_BENCHMARK_FILTER` environment variable which expect comma separated configuration names.
This is useful for targeting specific configuration(s) defined in `read/configs.yaml`.

For example, if you want to run multi process sequential and random reads only, you can set:
```bash
export GCSFS_BENCHMARK_FILTER="read_seq_multi_process, read_rand_multi_process"
pytest gcsfs/tests/perf/microbenchmarks/read/
```

## Function-level Fixture: `gcsfs_benchmark_read_write`

A function-level `pytest` fixture named `gcsfs_benchmark_read_write` (defined in `conftest.py`) is used to set up and tear down the environment for the benchmarks.

### Setup and Teardown

* **Setup**: Before a benchmark function runs, this fixture creates the specified number of files with the configured size in a temporary directory within the test bucket. It uses `os.urandom()` to write data in chunks to avoid high memory usage.
* **Teardown**: After the benchmark completes, the fixture recursively deletes the temporary directory and all the files created during the setup phase.

Here is how the fixture is used in a test:

```python
@pytest.mark.parametrize(
"gcsfs_benchmark_read_write",
single_threaded_cases,
indirect=True,
ids=lambda p: p.name,
)
def test_read_single_threaded(benchmark, gcsfs_benchmark_read_write):
gcs, file_paths, params = gcsfs_benchmark_read_write
# ... benchmark logic ...
```

### Environment Variables
To run the benchmarks, you need to configure your environment.
The orchestrator script (`run.py`) sets these for you, but if you are running `pytest` directly, you will need to export them.

* `GCSFS_TEST_BUCKET`: The name of a regional GCS bucket.
* `GCSFS_ZONAL_TEST_BUCKET`: The name of a zonal GCS bucket.
* `GCSFS_HNS_TEST_BUCKET`: The name of an HNS-enabled GCS bucket.

You must also set the following environment variables to ensure that the benchmarks run against the live GCS API and that experimental features are enabled.

```bash
export STORAGE_EMULATOR_HOST="https://storage.googleapis.com"
export GCSFS_EXPERIMENTAL_ZB_HNS_SUPPORT="true"
```

## Orchestrator Script (`run.py`)

An orchestrator script, `run.py`, is provided to simplify running the benchmark suite. It wraps `pytest`, sets up the necessary environment variables, and generates a summary report.

### Parameters

The script accepts several command-line arguments:

* `--group`: The benchmark group to run (e.g., `read`).
* `--config`: The name of a specific benchmark configuration to run (e.g., `read_seq`).
* `--regional-bucket`: Name of the Regional GCS bucket.
* `--zonal-bucket`: Name of the Zonal GCS bucket.
* `--hns-bucket`: Name of the HNS GCS bucket.
* `--log`: Set to `true` to enable `pytest` console logging.
* `--log-level`: Sets the log level (e.g., `INFO`, `DEBUG`).

**Important Notes:**
* You must provide at least one bucket name (`--regional-bucket`, `--zonal-bucket`, or `--hns-bucket`).

Run the script with `--help` to see all available options:
```bash
python gcsfs/tests/perf/microbenchmarks/run.py --help
```

### Examples

Here are some examples of how to use the orchestrator script from the root of the `gcsfs` repository:

Run all available benchmarks against a regional bucket with default settings. This is the simplest way to trigger all tests across all groups (e.g., read, write):
```bash
python gcsfs/tests/perf/microbenchmarks/run.py --regional-bucket your-regional-bucket
```

Run only the `read` group benchmarks against a regional bucket with the default 128MB file size:
```bash
python gcsfs/tests/perf/microbenchmarks/run.py --group read --regional-bucket your-regional-bucket
```

Run only the single-threaded sequential read benchmark with 256MB and 512MB file sizes:
```bash
python gcsfs/tests/perf/microbenchmarks/run.py \
--group read \
--config "read_seq" \
--regional-bucket your-regional-bucket
```

Run all read benchmarks against both a regional and a zonal bucket:
```bash
python gcsfs/tests/perf/microbenchmarks/run.py \
--group read \
--regional-bucket your-regional-bucket \
--zonal-bucket your-zonal-bucket
```

### Script Output

The script will create a timestamped directory in `gcsfs/tests/perf/microbenchmarks/__run__/` containing the JSON and CSV results, and it will print a summary table to the console.

#### JSON File (`results.json`)

The `results.json` file will contain a structured representation of the benchmark results.
The exact content can vary depending on the pytest-benchmark version and the tests run, but it typically includes:
* machine_info: Details about the system where the benchmarks were run (e.g., Python version, OS, CPU).
* benchmarks: A list of individual benchmark results, each containing:
* name: The name of the benchmark test.
* stats: Performance statistics like min, max, mean, stddev, rounds, iterations, ops (operations per second), q1, q3 (quartiles).
* options: Configuration options used for the benchmark (e.g., min_rounds, max_time).
* extra_info: Any additional information associated with the benchmark.

#### CSV File (`results.csv`)
The CSV file provides a detailed performance profile of gcsfs operations, allowing for analysis of how different factors like threading, process parallelism, and access patterns affect I/O throughput.
This file is a summarized view of the results generated in the JSON file and for each test run, the file records detailed performance statistics, including:
* Minimum, maximum, mean, and median execution times in secs.
* Standard deviation and percentile values (p90, p95, p99) for timing.
* The maximum throughput achieved, measured in Megabytes per second (MB/s).
* The maximum CPU and memory used during the test


#### Summary Table
The script also puts out a nice summary table like below, for quick glance at results.

| Bucket Type | Group | Pattern | Files | Threads | Processes | File Size (MB) | Chunk Size (MB) | Block Size (MB) | Min Latency (s) | Mean Latency (s) | Max Throughput (MB/s) | Max CPU (%) | Max Memory (MB) |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| regional | read | seq | 1 | 1 | 1 | 128.00 | 16.00 | 16.00 | 0.6391 | 0.7953 | 200.2678 | 0.26 | 507
| regional | read | rand | 1 | 1 | 1 | 128.00 | 16.00 | 16.00 | 0.6537 | 0.7843 | 195.8066 | 5.6 | 510
Loading
Loading