Skip to content

STRONGAYA/v6-tools-general

Repository files navigation

STRONG AYA's General Vantage6 tools

Test status Coverage Python 3.10+ Licence: Apache 2.0 STRONG AYA Algorithm Guideline Conformity: v1.0.3 Pending
Code style: black Linting: flake8 Type checking: mypy Security: bandit Security: safety

Purpose of this repository

This repository contains general functionalities and tools for the STRONG AYA project. They are designed to be used with the Vantage6 framework for federated analytics and learning and are intended to facilitate and simplify the development of Vantage6 algorithms.

The code in this repository is available as a Python library here on GitHub or through direct reference with pip.

Structure of the repository

The various functions are organised in different sections, consisting of:

  • General Statistics: Functions for calculating general statistics such as mean, median, and standard deviation;
  • Privacy Measures: Functions to enhance privacy protection such as sample size thresholding and differential privacy;
  • Miscellaneous: Functions that do not necessarily fit into the other categories, such as data type setting, and data stratification.

Usage

The library provides functions that can be included in a Vantage6 algorithm as the algorithm developer sees fit. The functions are designed to be modular and can be used independently or in combination with other functions.

The library can be included in your Vantage6 algorithm by listing it in the requirements.txt and setup.py file of your algorithm.

Including the library in your Vantage6 algorithm

For the requirements.txt file, you can add the following line to the file:

git+https://github.com/STRONGAYA/v6-tools-general.git@v1.0.3

For the setup.py file, you can add the following line to the install_requires list:

        "vantage6-strongaya-general @ git+https://github.com/STRONGAYA/v6-tools-general.git@v1.0.3",

The algorithm's setup.py, particularly the install_requirements, section file should then look something like this:

from os import path
from codecs import open
from setuptools import setup, find_packages

# We are using a README.md, if you do not have this in your folder, simply replace this with a string.
here = path.abspath(path.dirname(__file__))
with open(path.join(here, 'README.md'), encoding='utf-8') as f:
    long_description = f.read()
setup(
    name='v6-not-an-actual-algorithm',
    version="1.0.0",
    description='Fictive Vantage6 algorithm that performs general statistics computation.',
    long_description=long_description,
    long_description_content_type='text/markdown',
    url='https://github.com/STRONGAYA/v6-not-an-actual-algorithm',
    packages=find_packages(),
    python_requires='>=3.10',
    install_requires=[
        'vantage6-algorithm-tools',
        'numpy',
        'pandas',
        "vantage6-strongaya-general @ git+https://github.com/STRONGAYA/v6-tools-general.git@v1.0.3"
        # other dependencies
    ]
)

Central (aggregating) example

Example usage of various functions in a central (aggregating) section of a Vantage6 algorithm:

from vantage6.algorithm.tools.decorators import algorithm_client
from vantage6.algorithm.client import AlgorithmClient

# General federated algorithm functions
from vantage6_strongaya_general.miscellaneous import collect_organisation_ids, safe_log
from vantage6_strongaya_general.general_statistics import compute_aggregate_general_statistics


@algorithm_client
def central(client: AlgorithmClient, variables_to_analyse: dict, variables_to_stratify: dict = None,
            organisation_ids: list[int] = None) -> dict | None:
    """
    Central function to compute general statistics of specified variables from multiple organisations.

    Args:
        client (AlgorithmClient): The client to communicate with the vantage6 server.
        variables_to_analyse (dict): Dictionary of variables to analyse.
        variables_to_stratify (dict, optional): Dictionary of variables to stratify. Defaults to None.
        organisation_ids (list[int], optional): List of organisation IDs to include. Defaults to None.

    Returns:
        dict|None: A dictionary containing the aggregated EORTC QLQ scoring results,
        or None if the input structure is incorrect.
    """

    # Collect all organisations that participate in this collaboration unless specified
    organisation_ids = collect_organisation_ids(organisation_ids, client)

    # Create the subtask for general statistics
    safe_log("info", "Creating subtask to calculate general statistics.")

    input_ = {"method": "partial",
              "kwargs": {
                  "variables_to_analyse": variables_to_analyse,
                  "variables_to_stratify": variables_to_stratify}
              }

    task_general_statistics = client.task.create(input_, organisation_ids,
                                                 "General Statistics",
                                                 "This subtask determines the general statistics of specified variables.")

    # Wait for the node(s) to return the results of the subtask
    safe_log("info", f"Waiting for results of task {task_general_statistics.get('id')}")
    results_general_statistics = client.wait_for_results(task_general_statistics.get("id"))
    safe_log("info", f"Results of task {task_general_statistics.get('id')} obtained")

    # Aggregate the general statistics
    results_general_statistics = compute_aggregate_general_statistics(results_general_statistics)

    # Return the final results of the algorithm
    return results_general_statistics

Node or local (participating) example

Example usage of various functions in a node (participating) section of a Vantage6 algorithm:

import pandas as pd

from vantage6.algorithm.tools.util import info
from vantage6.algorithm.tools.decorators import data, algorithm_client
from vantage6.algorithm.client import AlgorithmClient

# General federated algorithm functions
from vantage6_strongaya_general.general_statistics import compute_local_general_statistics
from vantage6_strongaya_general.miscellaneous import apply_data_stratification, set_datatypes, safe_log
from vantage6_strongaya_general.privacy_measures import apply_sample_size_threshold, mask_unnecessary_variables,

apply_differential_privacy


@data(1)
@algorithm_client
def partial_general_statistics(client: AlgorithmClient, df: pd.DataFrame, variables_to_analyse: dict,
                               variables_to_stratify: dict = None) -> dict:
    """
    Execute the partial algorithm for general statistics computation.

    Args:
        client (AlgorithmClient): The client to communicate with the vantage6 server.
        df (pd.DataFrame): The DataFrame containing the data to be processed.
        variables_to_analyse (dict): Dictionary of variables to analyse.
        variables_to_stratify (dict, optional): Dictionary of variables to stratify. Defaults to None.

    Returns:
        dict: A dictionary containing the computed general statistics.
    """
    safe_log("info", "Executing partial algorithm computation of general statistics.")

    # Set datatypes for each variable
    df = set_datatypes(df, variables_to_analyse)

    # Apply stratification if necessary
    df = apply_data_stratification(df, variables_to_stratify)

    # Ensure that the sample size threshold is met
    df = apply_sample_size_threshold(client, df, variables_to_analyse)

    # Mask unnecessary variables by removal
    df = mask_unnecessary_variables(df, variables_to_analyse)

    # Apply differential privacy (Laplace mechanism as per default)
    df = apply_differential_privacy(df, variables_to_analyse, epsilon=1.0, return_type='dataframe')

    # Compute general statistics
    result = compute_local_general_statistics(df, variables_to_stratify)

    return result

The various functions are available through pip install for debugging and testing purposes. The library can be installed as follows:

pip install git+https://github.com/STRONGAYA/v6-tools-general.git

Testing

This repository includes a comprehensive testing framework to ensure the reliability and correctness of all functions, especially in federated scenarios.

Test Structure

tests/
├── conftest.py                           # Common fixtures and test utilities
├── unit/                                 # Unit tests for individual functions
│   ├── test_general_statistics.py        # Tests for statistical functions
│   ├── test_miscellaneous.py            # Tests for utility functions
│   └── test_privacy_measures.py         # Tests for privacy functions
├── integration/                          # Integration tests
│   └── test_stratification.py           # Data stratification workflows
├── empirical/                            # Empirical validation tests
│   └── test_federated_vs_centralised.py # Federated vs centralised comparisons
└── utils/                               # Test helper utilities
    └── test_helpers.py                  # Validation and comparison tools

Running Tests

Prerequisites

Install test dependencies:

pip install pytest pytest-cov pytest-mock hypothesis faker

Basic Test Execution

# Run all tests
pytest

# Run unit tests only
pytest tests/unit/

# Run integration tests only
pytest tests/integration/

# Run empirical tests only
pytest tests/empirical/

# Run with coverage report
pytest --cov=vantage6_strongaya_general --cov-report=html

# Run specific test module
pytest tests/unit/test_general_statistics.py

# Run with verbose output
pytest -v

Test Categories

  • Unit Tests: Test individual functions in isolation
  • Integration Tests: Test complete workflows and component interactions
  • Empirical Tests: Validate federated vs centralised mathematical equivalence
  • Performance Tests: Benchmark function performance with large datasets
  • Edge Case Tests: Test behaviour with unusual data distributions

Federated vs Centralised Validation

The test suite includes comprehensive empirical validation that federated statistical computations produce equivalent results to their centralised counterparts:

# Example: Testing federated statistics match centralised
def test_federated_equals_centralised():
    # Split data across organisations
    federated_data = split_by_organisation(test_data)

    # Compute federated results
    local_results = [compute_local_stats(org_data) for org_data in federated_data]
    federated_result = aggregate_results(local_results)

    # Compute centralised result
    centralised_result = compute_centralised_stats(combined_data)

    # Validate equivalence
    assert_federated_equals_centralised(federated_result, centralised_result)

Test Data

The test suite uses synthetic datasets that:

  • Cover various statistical distributions (normal, skewed, uniform)
  • Include edge cases (small samples, missing data, outliers)
  • Simulate realistic medical research scenarios
  • Test privacy-preserving mechanisms

Continuous Integration

Tests run automatically on every push and pull request via GitHub Actions:

  • Multiple Python versions (starting with 3.10)
  • Code coverage reporting
  • Performance benchmarking
  • Security scanning

Known Test Failures

Some empirical tests may occasionally fail due to the inherent mathematical differences between federated and centralised computations:

Empirical Tests (tests/empirical/):

  • test_single_organisation_equivalence: May fail due to division-by-zero issues in quantile calculations for single organisations. This doesn't affect multi-organisation federated scenarios.
  • test_mixed_distribution_quantiles: May fail when organisations have very different data distributions, as federated quantiles mathematically differ from centralised ones when internal distributions vary significantly.

What this means for usage:

  • These failures are mathematical expectations, not bugs
  • Federated quantiles with mixed distributions across organisations will naturally differ from centralised calculations
  • Single organisation scenarios work correctly in practice, but may have edge cases in quantile computation
  • All basic statistics (mean, count, min, max) maintain mathematical equivalence between federated and centralised approaches
  • Standard deviation allows for appropriate tolerance (±15% relative, ±0.5 absolute) due to federated computation characteristics

Unit and Integration Tests: All should pass consistently, as they test core functionality and realistic federated workflows.

Contributing to Tests

When contributing new functionality:

  1. Add unit tests for all new functions
  2. Add integration tests for complete workflows
  3. Add empirical tests for federated vs centralised scenarios
  4. Include edge case testing for robustness
  5. Update test data if needed for new scenarios
  6. Maintain acceptable degree of code coverage

Test Guidelines

  • Use descriptive test names that explain what is being tested
  • Include both positive and negative test cases
  • Test edge cases and error conditions
  • Use realistic synthetic data
  • Mock external dependencies (AlgorithmClient, environment variables)
  • Validate both structure and values of results

Contributers

  • J. Hogenboom
  • V. Gouthamchand
  • A. Lobo Gomes

References

About

Tools for general computations and data handling in federated learning using Vantage6.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 5

Languages