fsspec · jasha26 · Dec 15, 2025 · Dec 23, 2025 · Dec 23, 2025 · Dec 24, 2025
diff --git a/.gitignore b/.gitignore
@@ -103,3 +103,5 @@ target/
 .pytest_cache/
 
 libs/*.whl
+
+gcsfs/tests/perf/microbenchmarks/__run__
diff --git a/.isort.cfg b/.isort.cfg
@@ -1,3 +1,3 @@
 [settings]
 profile = black
-known_third_party = aiohttp,click,decorator,fsspec,fuse,google,google_auth_oauthlib,pytest,requests,setuptools
+known_third_party = aiohttp,click,conftest,decorator,fsspec,fuse,google,google_auth_oauthlib,numpy,prettytable,psutil,pytest,requests,resource_monitor,setuptools,yaml
diff --git a/cloudbuild/e2e-tests-cloudbuild.yaml b/cloudbuild/e2e-tests-cloudbuild.yaml
@@ -129,7 +129,7 @@ steps:
 
           pip install --upgrade pip > /dev/null
           # Install testing libraries explicitly, as they are not in setup.py
-          pip install pytest pytest-timeout pytest-subtests pytest-asyncio fusepy google-cloud-storage > /dev/null
+          pip install pytest pytest-timeout pytest-subtests pytest-asyncio fusepy google-cloud-storage psutil PyYAML > /dev/null
           pip install -e . > /dev/null
 
           echo '--- Preparing test environment on VM ---'

diff --git a/environment_gcsfs.yaml b/environment_gcsfs.yaml
@@ -13,12 +13,17 @@ dependencies:
   - google-auth-oauthlib
   - google-cloud-core
   - google-cloud-storage
+  - numpy
   - grpcio
   - pytest
+  - pytest-benchmark
   - pytest-timeout
   - pytest-asyncio
   - pytest-subtests
+  - psutil
+  - ptable
   - requests
   - ujson
+  - pyyaml
   - pip:
       - git+https://github.com/fsspec/filesystem_spec
diff --git a/gcsfs/extended_gcsfs.py b/gcsfs/extended_gcsfs.py
@@ -144,6 +144,7 @@ def _open(
         path,
         mode="rb",
         block_size=None,
+        cache_type="readahead",
         cache_options=None,
         acl=None,
         consistency=None,
@@ -163,6 +164,7 @@ def _open(
             path,
             mode,
             block_size=block_size or self.default_block_size,
+            cache_type=cache_type,
             cache_options=cache_options,
             consistency=consistency or self.consistency,
             metadata=metadata,

diff --git a/gcsfs/tests/conftest.py b/gcsfs/tests/conftest.py
@@ -66,6 +66,12 @@
 
 params = dict()
 
+BUCKET_NAME_MAP = {
+    "regional": TEST_BUCKET,
+    "zonal": TEST_ZONAL_BUCKET,
+    "hns": TEST_HNS_BUCKET,
+}
+
 
 def stop_docker(container):
     cmd = shlex.split('docker ps -a -q --filter "name=%s"' % container)

diff --git a/gcsfs/tests/perf/microbenchmarks/README.md b/gcsfs/tests/perf/microbenchmarks/README.md
@@ -0,0 +1,192 @@
+# GCSFS Microbenchmarks
+
+## Introduction
+
+This document describes the microbenchmark suite for `gcsfs`. These benchmarks are designed to measure the performance of various I/O operations under different conditions. They are built using `pytest` and the `pytest-benchmark` plugin to provide detailed performance metrics for single-threaded, multi-threaded, and multi-process scenarios.
+
+## Prerequisites
+
+Before running the benchmarks, ensure you have installed the project's dependencies for performance testing. This can be done by running the following command from the root of the repository:
+```bash
+pip install -r gcsfs/tests/perf/microbenchmarks/requirements.txt
+```
+
+This will install `pytest`, `pytest-benchmark`, and other necessary dependencies.
+For more information on `pytest-benchmark`, you can refer to its official documentation. [1]
+
+## Read Benchmarks
+
+The read benchmarks are located in `gcsfs/tests/perf/microbenchmarks/read/` and are designed to test read performance with various configurations.
+
+### Parameters
+
+The read benchmarks are defined by the `ReadBenchmarkParameters` class in `read/parameters.py`. Key parameters include:
+
+*   `name`: The name of the benchmark configuration.
+*   `num_files`: The number of files to use, this is always num_processes x num_threads.
+*   `pattern`: Read pattern, either sequential (`seq`) or random (`rand`).
+*   `num_threads`: Number of threads for multi-threaded tests.
+*   `num_processes`: Number of processes for multi-process tests.
+*   `block_size_bytes`: The block size for gcsfs file buffering. Defaults to `16MB`.
+*   `chunk_size_bytes`: The size of each read operation. Defaults to `16MB`.
+*   `file_size_bytes`: The total size of each file.
+*   `rounds`: The total number of pytest-benchmark rounds for each parameterized test. Defaults to `10`.
+
+
+To ensure that the results are stable and not skewed by outliers, each benchmark is run for a set number of rounds.
+By default, this is set to 10 rounds, but it can be configured via `rounds` parameter if needed. This helps in providing a more accurate and reliable performance profile.
+
+### Configurations
+
+The base configurations in `read/configs.yaml` are simplified to just `read_seq` and `read_rand`. Decorators are then used to generate a full suite of test cases by creating variations for parallelism, file sizes, and bucket types.
+
+The benchmarks are split into three main test functions based on the execution model:
+
+*   `test_read_single_threaded`: Measures baseline performance of read operations.
+*   `test_read_multi_threaded`: Measures performance with multiple threads.
+*   `test_read_multi_process`: Measures performance using multiple processes, each with its own set of threads.
+
+### Running Benchmarks with `pytest`
+
+You can use `pytest` to run the benchmarks directly.
+The `GCSFS_BENCHMARK_FILTER` option is useful for filtering tests by name.
+
+**Examples:**
+
+Run all read benchmarks:
+```bash
+pytest gcsfs/tests/perf/microbenchmarks/read/
+```
+
+Run a specific benchmark(s) configuration by setting `GCSFS_BENCHMARK_FILTER` environment variable which expect comma separated configuration names.
+This is useful for targeting specific configuration(s) defined in `read/configs.yaml`.
+
+For example, if you want to run multi process sequential and random reads only, you can set:
+```bash
+export GCSFS_BENCHMARK_FILTER="read_seq_multi_process, read_rand_multi_process"
+pytest gcsfs/tests/perf/microbenchmarks/read/
+```
+
+## Function-level Fixture: `gcsfs_benchmark_read_write`
+
+A function-level `pytest` fixture named `gcsfs_benchmark_read_write` (defined in `conftest.py`) is used to set up and tear down the environment for the benchmarks.
+
+### Setup and Teardown
+
+*   **Setup**: Before a benchmark function runs, this fixture creates the specified number of files with the configured size in a temporary directory within the test bucket. It uses `os.urandom()` to write data in chunks to avoid high memory usage.
+*   **Teardown**: After the benchmark completes, the fixture recursively deletes the temporary directory and all the files created during the setup phase.
+
+Here is how the fixture is used in a test:
+
+```python
+@pytest.mark.parametrize(
+    "gcsfs_benchmark_read_write",
+    single_threaded_cases,
+    indirect=True,
+    ids=lambda p: p.name,
+)
+def test_read_single_threaded(benchmark, gcsfs_benchmark_read_write):
+    gcs, file_paths, params = gcsfs_benchmark_read_write
+    # ... benchmark logic ...
+```
+
+### Environment Variables
+To run the benchmarks, you need to configure your environment.
+The orchestrator script (`run.py`) sets these for you, but if you are running `pytest` directly, you will need to export them.
+
+*   `GCSFS_TEST_BUCKET`: The name of a regional GCS bucket.
+*   `GCSFS_ZONAL_TEST_BUCKET`: The name of a zonal GCS bucket.
+*   `GCSFS_HNS_TEST_BUCKET`: The name of an HNS-enabled GCS bucket.
+
+You must also set the following environment variables to ensure that the benchmarks run against the live GCS API and that experimental features are enabled.
+
+```bash
+export STORAGE_EMULATOR_HOST="https://storage.googleapis.com"
+export GCSFS_EXPERIMENTAL_ZB_HNS_SUPPORT="true"
+```
+
+## Orchestrator Script (`run.py`)
+
+An orchestrator script, `run.py`, is provided to simplify running the benchmark suite. It wraps `pytest`, sets up the necessary environment variables, and generates a summary report.
+
+### Parameters
+
+The script accepts several command-line arguments:
+
+*   `--group`: The benchmark group to run (e.g., `read`).
+*   `--config`: The name of a specific benchmark configuration to run (e.g., `read_seq`).
+*   `--regional-bucket`: Name of the Regional GCS bucket.
+*   `--zonal-bucket`: Name of the Zonal GCS bucket.
+*   `--hns-bucket`: Name of the HNS GCS bucket.
+*   `--log`: Set to `true` to enable `pytest` console logging.
+*   `--log-level`: Sets the log level (e.g., `INFO`, `DEBUG`).
+
+**Important Notes:**
+*   You must provide at least one bucket name (`--regional-bucket`, `--zonal-bucket`, or `--hns-bucket`).
+
+Run the script with `--help` to see all available options:
+```bash
+python gcsfs/tests/perf/microbenchmarks/run.py --help
+```
+
+### Examples
+
+Here are some examples of how to use the orchestrator script from the root of the `gcsfs` repository:
+
+Run all available benchmarks against a regional bucket with default settings. This is the simplest way to trigger all tests across all groups (e.g., read, write):
+```bash
+python gcsfs/tests/perf/microbenchmarks/run.py --regional-bucket your-regional-bucket
+```
+
+Run only the `read` group benchmarks against a regional bucket with the default 128MB file size:
+```bash
+python gcsfs/tests/perf/microbenchmarks/run.py --group read --regional-bucket your-regional-bucket
+```
+
+Run only the single-threaded sequential read benchmark with 256MB and 512MB file sizes:
+```bash
+python gcsfs/tests/perf/microbenchmarks/run.py \
+  --group read \
+  --config "read_seq" \
+  --regional-bucket your-regional-bucket
+```
+
+Run all read benchmarks against both a regional and a zonal bucket:
+```bash
+python gcsfs/tests/perf/microbenchmarks/run.py \
+  --group read \
+  --regional-bucket your-regional-bucket \
+  --zonal-bucket your-zonal-bucket
+```
+
+### Script Output
+
+The script will create a timestamped directory in `gcsfs/tests/perf/microbenchmarks/__run__/` containing the JSON and CSV results, and it will print a summary table to the console.
+
+#### JSON File (`results.json`)
+
+The `results.json` file will contain a structured representation of the benchmark results.
+The exact content can vary depending on the pytest-benchmark version and the tests run, but it typically includes:
+*   machine_info: Details about the system where the benchmarks were run (e.g., Python version, OS, CPU).
+*   benchmarks: A list of individual benchmark results, each containing:
+    *   name: The name of the benchmark test.
+    *   stats: Performance statistics like min, max, mean, stddev, rounds, iterations, ops (operations per second), q1, q3 (quartiles).
+    *   options: Configuration options used for the benchmark (e.g., min_rounds, max_time).
+    *   extra_info: Any additional information associated with the benchmark.
+
+#### CSV File (`results.csv`)
+The CSV file provides a detailed performance profile of gcsfs operations, allowing for analysis of how different factors like threading, process parallelism, and access patterns affect I/O throughput.
+This file is a summarized view of the results generated in the JSON file and for each test run, the file records detailed performance statistics, including:
+*   Minimum, maximum, mean, and median execution times in secs.
+*   Standard deviation and percentile values (p90, p95, p99) for timing.
+*   The maximum throughput achieved, measured in Megabytes per second (MB/s).
+*   The maximum CPU and memory used during the test
+
+
+#### Summary Table
+The script also puts out a nice summary table like below, for quick glance at results.
+
+| Bucket Type | Group | Pattern | Files | Threads | Processes | File Size (MB) | Chunk Size (MB) | Block Size (MB) | Min Latency (s) | Mean Latency (s) | Max Throughput (MB/s) | Max CPU (%) | Max Memory (MB) |
+| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
+| regional | read | seq | 1 | 1 | 1 | 128.00 | 16.00 | 16.00 | 0.6391 | 0.7953 | 200.2678 | 0.26 | 507
+| regional | read | rand | 1 | 1 | 1 | 128.00 | 16.00 | 16.00 | 0.6537 | 0.7843 | 195.8066 | 5.6 | 510
Original file line number	Diff line number	Diff line change
Expand Up		@@ -103,3 +103,5 @@ target/
		.pytest_cache/

		libs/*.whl

		gcsfs/tests/perf/microbenchmarks/__run__