Skip to content

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Oct 21, 2025

Closes #264

Adds num_workers to

  • Backend.copy_file()
  • Backend.get_archive()
  • Backend.get_file()
  • Backend.move_file()
  • Interface.copy_file()
  • Interface.get_file()
  • Interface.move_file()

I decided to add those to the methods and not to the Backend instantiation as it allows for easier using them.
I further decided to not provide the chunk_size argument to the user, but let's the backend internally decide what the correct chunk size should be.

For the Artifactory and Filesystem backends we ignore num_workers and use a single worker, for Minio we support using several workers to speed up the download, using a chunk size to split the file in num_workers parts.

I created audeering/audmodel#35 to test the changes in audmodel.

Archive extraction

Even though I add num_workers to get_archive() here, it is only used during file download, not during file extraction. I would propose we first investigate multi-threading file extraction further at audeering/audeer#184 and add it here in a separate pull request if we think it makes sense.

Keyboard interruption

A user can interrupt a download with Ctrl+C. This needed some additional code for num_workers > 1 as the code handling it inside audeer.run_tasks() is not sufficient when writing to files. In that case the file handler would wait until response.read() finished. With the help of Claude I found a solution that ensures we now always immediately return when the user presses Ctrl+C independent of the number of workers.

Code to test user interruption
#!/usr/bin/env python
"""Test script to verify Ctrl+C handling in multi-worker downloads.

Usage:
    python test_interrupt.py

While the download is in progress, press Ctrl+C to test interrupt handling.
The download should stop quickly and clean up the partial file.
"""
import os
import audbackend

# Configure your MinIO/S3 backend
host = "s3.dualstack.eu-north-1.amazonaws.com"
repository = "audmodel-internal"
src_path = "/alm/audeering-omni/stage1_2/torch/7289b57d.zip"  # Large file for testing
version = "1.0.0"
dst_path = "./test_download.zip"

# Test with multiple workers
print("Testing multi-worker download (press Ctrl+C to interrupt)...")
print(f"Downloading {src_path} from {repository}...")

backend = audbackend.backend.Minio(host, repository)
interface = audbackend.interface.Maven(backend)

try:
    backend.open()
    interface.get_file(
        src_path=src_path,
        dst_path=dst_path,
        version=version,
        num_workers=5,  # Use multiple workers to test interrupt handling
        verbose=True,
    )
    print("Download completed successfully!")
except KeyboardInterrupt:
    print("\nDownload interrupted by user (Ctrl+C)")
    print(f"Partial file cleaned up: {not os.path.exists(dst_path)}")
finally:
    backend.close()
    # Clean up if file was fully downloaded
    if os.path.exists(dst_path):
        os.remove(dst_path)
        print("Test file removed")

Benchmarks

I tested the current implementation on the model 7289b57d-1.0.0 (4.2 GB) by running:

$ cd benchmarks
$ uv run --python 3.12 minio-parallel.py

on a server with 10 CPUs.

num_workers num_iter elapsed(avg) elapsed(std)
1 10 0:01:05.592122 0:00:04.613981
2 10 0:00:23.792445 0:00:03.151314
3 10 0:00:15.051508 0:00:00.020850
4 10 0:00:12.270467 0:00:00.744683
5 10 0:00:13.566350 0:00:00.284529
10 10 0:00:13.096010 0:00:00.575895

Summary by Sourcery

Introduce a num_workers parameter to enable parallel downloads in Minio, improve Ctrl+C interrupt handling with immediate cleanup, extend tests for multi-worker and interrupt scenarios, and add benchmarks

New Features:

  • Add num_workers parameter to copy_file, get_file, and move_file methods in both Backend and Interface layers
  • Enable parallel chunked downloads in the Minio backend by splitting files into parts

Enhancements:

  • Introduce SIGINT signal handler and cancel_event to promptly abort downloads and clean up partial files
  • Pre-allocate destination files and distribute download tasks across threads for multi-worker mode
  • Ignore num_workers in Artifactory and Filesystem backends and default to single-worker behavior

Build:

  • Bump audeer dependency to >=2.3.1

Documentation:

  • Provide a benchmark script for measuring Minio download performance with varying numbers of workers

Tests:

  • Add multi-worker download tests to verify file consistency across different num_workers settings
  • Add tests for signal handler registration, cancel_event-triggered KeyboardInterrupt, and partial file cleanup on interruption

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 21, 2025

Reviewer's Guide

This pull request extends the file transfer API to support parallel downloads by adding a num_workers parameter throughout the Backend and Interface layers. The Minio backend’s get_file implementation is rewritten to pre-allocate the target file, split the download into chunks, and dispatch them via audeer.run_tasks. A cancellation event and signal handler are introduced to ensure immediate cleanup on Ctrl+C. The change is covered by new tests for parallel downloads and interrupt handling, a benchmark script, and a dependency bump.

Sequence diagram for parallel file download with Minio backend

sequenceDiagram
    participant User
    participant Interface
    participant MinioBackend
    participant audeer
    User->>Interface: get_file(src_path, dst_path, version, num_workers>1)
    Interface->>MinioBackend: get_file(src_path, dst_path, num_workers)
    MinioBackend->>MinioBackend: pre-allocate dst_path
    MinioBackend->>audeer: run_tasks(_download_file, tasks, num_workers)
    audeer->>MinioBackend: _download_file(src_path, dst_path, pbar, cancel_event, offset, length)
    MinioBackend->>MinioBackend: write chunk to dst_path
    MinioBackend->>audeer: update progress bar
    Note over User,MinioBackend: If Ctrl+C pressed, cancel_event is set
    MinioBackend->>MinioBackend: cleanup partial file
Loading

Class diagram for updated Backend and Interface file transfer methods

classDiagram
    class Backend {
        +copy_file(src_path, dst_path, num_workers=1, validate=False, verbose=False)
        +get_file(src_path, dst_path, num_workers=1, validate=False, verbose=False)
        +move_file(src_path, dst_path, num_workers=1, validate=False, verbose=False)
        +get_archive(src_path, dst_root, tmp_root=None, num_workers=1, validate=False, verbose=False)
    }
    class MinioBackend {
        +_copy_file(src_path, dst_path, num_workers, verbose)
        +_get_file(src_path, dst_path, num_workers, verbose)
        +_move_file(src_path, dst_path, num_workers, verbose)
        +_download_file(src_path, dst_path, pbar, cancel_event, offset=0, length=None)
    }
    class ArtifactoryBackend {
        +_copy_file(src_path, dst_path, num_workers, verbose)
        +_get_file(src_path, dst_path, num_workers, verbose)
        +_move_file(src_path, dst_path, num_workers, verbose)
    }
    class FilesystemBackend {
        +_copy_file(src_path, dst_path, num_workers, verbose)
        +_get_file(src_path, dst_path, num_workers, verbose)
        +_move_file(src_path, dst_path, num_workers, verbose)
    }
    Backend <|-- MinioBackend
    Backend <|-- ArtifactoryBackend
    Backend <|-- FilesystemBackend
    class VersionedInterface {
        +copy_file(src_path, dst_path, version=None, num_workers=1, validate=False, verbose=False)
        +get_file(src_path, dst_path, version, num_workers=1, validate=False, verbose=False)
        +move_file(src_path, dst_path, version=None, num_workers=1, validate=False, verbose=False)
    }
    class UnversionedInterface {
        +copy_file(src_path, dst_path, num_workers=1, validate=False, verbose=False)
        +get_file(src_path, dst_path, num_workers=1, validate=False, verbose=False)
        +move_file(src_path, dst_path, num_workers=1, validate=False, verbose=False)
    }
Loading

File-Level Changes

Change Details Files
Expose a num_workers parameter on all copy_file, get_file, and move_file methods
  • Added num_workers argument with default=1 to method signatures in Backend (base, Minio, Artifactory, Filesystem) and Interface (versioned, unversioned)
  • Propagate num_workers through base._copy_file, get_file, get_archive, and move_file implementations
  • Updated documentation and pyproject.toml dependency to require audeer >=2.3.1
audbackend/core/backend/base.py
audbackend/core/backend/minio.py
audbackend/core/backend/artifactory.py
audbackend/core/backend/filesystem.py
audbackend/core/interface/versioned.py
audbackend/core/interface/unversioned.py
pyproject.toml
Implement multi-threaded download in Minio backend with cancellation support
  • Rewrote MinioBackend._get_file to pre-allocate file, split into offsets, and schedule tasks via audeer.run_tasks
  • Extracted per-segment download logic into _download_file, reading with offset/length and updating a shared progress bar
  • Installed a SIGINT handler that sets a cancel_event for immediate termination and removes partial files on interrupt
audbackend/core/backend/minio.py
Add tests for parallel downloads and Ctrl+C interruption
  • Created helper create_file_exact_size for generating test files
  • Tested get_file with num_workers=1 vs 2 and verified identical outputs
  • Simulated SIGINT and verify cancel_event is set and _download_file raises KeyboardInterrupt
  • Ensured partial file cleanup on interrupt via monkeypatched download
tests/test_backend_minio.py
Add performance benchmark script for multi-worker download
  • Added benchmarks/minio-parallel.py to measure download times over various num_workers
  • Provided README entry with results formatting
benchmarks/minio-parallel.py
benchmarks/README.rst

Assessment against linked issues

Issue Objective Addressed Explanation
#264 Add num_workers and chunk_size arguments to Backend.get_file() and Backend.get_archive(), allowing backends to implement parallel download or ignore the arguments if not applicable.
#264 Set default parameters: num_workers=1, chunk_size=None (let backend choose chunk size if None).
#264 Implement parallel download for MinIO backend using multiple workers, and ignore num_workers for other backends.

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@codecov
Copy link

codecov bot commented Oct 23, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.0%. Comparing base (c6e6aac) to head (c3c3088).

Additional details and impacted files
Files with missing lines Coverage Δ
audbackend/core/backend/artifactory.py 100.0% <ø> (ø)
audbackend/core/backend/base.py 100.0% <100.0%> (ø)
audbackend/core/backend/filesystem.py 100.0% <ø> (ø)
audbackend/core/backend/minio.py 100.0% <100.0%> (ø)
audbackend/core/interface/unversioned.py 100.0% <ø> (ø)
audbackend/core/interface/versioned.py 100.0% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hagenw hagenw marked this pull request as ready for review November 11, 2025 12:22
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Consider refactoring num_workers into a backend-level setting (for example a constructor parameter) rather than threading it through every copy/get/move method to reduce duplication and keep signatures simpler.
  • Mutating the global SIGINT handler inside _get_file can interfere with other listeners or multithreaded contexts—consider a context manager or higher-level cancellation API instead of calling signal.signal directly.
  • Rather than writing directly to the final dst_path, download into a temp file and atomically rename on success to avoid leaving partial files if the process is killed unexpectedly.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider refactoring num_workers into a backend-level setting (for example a constructor parameter) rather than threading it through every copy/get/move method to reduce duplication and keep signatures simpler.
- Mutating the global SIGINT handler inside _get_file can interfere with other listeners or multithreaded contexts—consider a context manager or higher-level cancellation API instead of calling signal.signal directly.
- Rather than writing directly to the final dst_path, download into a temp file and atomically rename on success to avoid leaving partial files if the process is killed unexpectedly.

## Individual Comments

### Comment 1
<location> `audbackend/core/backend/minio.py:305-308` </location>
<code_context>
+                # Create and run download tasks
+                tasks = []
+                chunk_size = src_size // num_workers
+                for i in range(num_workers):
+                    offset = i * chunk_size
+                    length = chunk_size if i < num_workers - 1 else src_size - offset
+                    tasks.append(
+                        ([src_path, dst_path, pbar, cancel_event, offset, length], {})
+                    )
</code_context>

<issue_to_address>
**suggestion:** Chunk calculation may result in zero-length chunks if num_workers exceeds file size.

Ensure that num_workers does not exceed src_size // chunk_size to prevent assigning zero-byte tasks. Alternatively, validate that each worker receives a non-zero chunk before task creation.

Suggested implementation:

```python
                # Create and run download tasks
                tasks = []
                # Ensure num_workers does not exceed src_size
                num_workers = min(num_workers, src_size) if src_size > 0 else 1
                chunk_size = src_size // num_workers

```

```python
                for i in range(num_workers):
                    offset = i * chunk_size
                    # Ensure each worker gets at least one byte if possible
                    length = chunk_size if i < num_workers - 1 else src_size - offset
                    if length > 0:
                        tasks.append(
                            ([src_path, dst_path, pbar, cancel_event, offset, length], {})
                        )

```
</issue_to_address>

### Comment 2
<location> `audbackend/core/backend/minio.py:334-336` </location>
<code_context>
+        chunk_size = 4 * 1024  # 4 KB
+
+        # Get the data stream
+        kwargs = {"offset": offset, "length": length} if length else {}
+        response = self._client.get_object(self.repository, src_path, **kwargs)
+
+        try:
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Passing length=None may result in unexpected behavior for get_object.

If offset is set but length is None, the current logic omits both, which may not fetch the correct data range. Recommend always including offset, and only exclude length when it is None.

```suggestion
        # Get the data stream
        kwargs = {"offset": offset}
        if length is not None:
            kwargs["length"] = length
        response = self._client.get_object(self.repository, src_path, **kwargs)
```
</issue_to_address>

### Comment 3
<location> `audbackend/core/backend/minio.py:343-349` </location>
<code_context>
+                if offset:
+                    f.seek(offset)
+
+                with pbar:
+                    while data := response.read(chunk_size):
+                        # Check if cancellation was requested
+                        if cancel_event and cancel_event.is_set():
+                            raise KeyboardInterrupt("Download cancelled by user")
+                        f.write(data)
+                        pbar.update(len(data))
+        finally:
+            response.close()
</code_context>

<issue_to_address>
**issue (bug_risk):** Using the progress bar as a context manager inside each thread may cause synchronization issues.

Simultaneous entry into the progress bar context by multiple threads can cause race conditions and inaccurate progress updates. Use thread-safe mechanisms for progress updates or manage the progress bar outside thread contexts.
</issue_to_address>

### Comment 4
<location> `audbackend/core/backend/minio.py:315-316` </location>
<code_context>
+                audeer.run_tasks(self._download_file, tasks, num_workers=num_workers)
+        except KeyboardInterrupt:
+            # Clean up partial file
+            if os.path.exists(dst_path):
+                os.remove(dst_path)
+            raise
+        finally:
</code_context>

<issue_to_address>
**issue (bug_risk):** Removing the partial file on KeyboardInterrupt may race with other threads.

If threads are still writing when the file is removed, this could cause data corruption or errors. Use a lock or ensure all threads have finished before deleting the file.
</issue_to_address>

### Comment 5
<location> `benchmarks/README.rst:10` </location>
<code_context>
+Parallel file loading
+---------------------
+
+The ``Minio`` backend support parallel loading of files.
+It can be benchmarked with:
+
</code_context>

<issue_to_address>
**issue (typo):** Correct 'support' to 'supports' for subject-verb agreement.

Change to 'supports' for correct grammar.

```suggestion
The ``Minio`` backend supports parallel loading of files.
```
</issue_to_address>

### Comment 6
<location> `audbackend/core/backend/minio.py:212` </location>
<code_context>
    def _copy_file(
        self,
        src_path: str,
        dst_path: str,
        num_workers: int,
        verbose: bool,
    ):
        r"""Copy file on backend."""
        src_path = self.path(src_path)
        dst_path = self.path(dst_path)
        checksum = self._checksum(src_path)
        # `copy_object()` has a maximum size limit of 5GB.
        # We use 4.9GB to have some headroom
        if self._size(src_path) / 1024 / 1024 / 1024 >= 4.9:
            with tempfile.TemporaryDirectory() as tmp_dir:
                tmp_path = audeer.path(tmp_dir, os.path.basename(src_path))
                self._get_file(src_path, tmp_path, num_workers, verbose)
                self._put_file(tmp_path, dst_path, checksum, verbose)
        else:
            self._client.copy_object(
                self.repository,
                dst_path,
                minio.commonconfig.CopySource(self.repository, src_path),
                metadata=_metadata(checksum),
            )

</code_context>

<issue_to_address>
**suggestion (code-quality):** Simplify numeric comparison [×3] ([`simplify-numeric-comparison`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/simplify-numeric-comparison/))

```suggestion
        if self._size(src_path) >= 5261334937.6:
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@hagenw hagenw self-assigned this Nov 11, 2025
@hagenw hagenw requested a review from frankenjoe November 11, 2025 16:40
@hagenw
Copy link
Member Author

hagenw commented Nov 11, 2025

When you look into the related audmodel pull request the speedup is not as big, but maybe I made an error there.

@hagenw
Copy link
Member Author

hagenw commented Nov 13, 2025

Hey there - I've reviewed your changes - here's some feedback:

  • Consider refactoring num_workers into a backend-level setting (for example a constructor parameter) rather than threading it through every copy/get/move method to reduce duplication and keep signatures simpler.
  • Mutating the global SIGINT handler inside _get_file can interfere with other listeners or multithreaded contexts—consider a context manager or higher-level cancellation API instead of calling signal.signal directly.
  • Rather than writing directly to the final dst_path, download into a temp file and atomically rename on success to avoid leaving partial files if the process is killed unexpectedly.

These are some valid points, my answers to it:

  • I started with that approach as it was also implemented that way in the first proposal by @frankenjoe. But when integrating it in audmodel I realized that we would need to overwrite attributes to apply num_workers when downloading models as we usually only instantiate the backend once, but then the user can set num_workers when downloading files or doing other stuff. So, I decided to make it part of the individual methods.
  • This might be correct, but as I did not know anything about SIGINT I stayed with the suggestion from Claude. But maybe I should spend more time here and come up with a proper solution?
  • Writing to a temp file sounds like a good suggestion. I could add it in this pull request or work on it in a follow up one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speed up file download by using multiple workers

2 participants