This repository contains general functionalities and tools for the STRONG AYA project. They are designed to be used with the Vantage6 framework for federated analytics and learning and are intended to facilitate and simplify the development of Vantage6 algorithms.
The code in this repository is available as a Python library here on GitHub or through direct reference with pip.
The various functions are organised in different sections, consisting of:
- General Statistics: Functions for calculating general statistics such as mean, median, and standard deviation;
- Privacy Measures: Functions to enhance privacy protection such as sample size thresholding and differential privacy;
- Miscellaneous: Functions that do not necessarily fit into the other categories, such as data type setting, and data stratification.
The library provides functions that can be included in a Vantage6 algorithm as the algorithm developer sees fit. The functions are designed to be modular and can be used independently or in combination with other functions.
The library can be included in your Vantage6 algorithm by listing it in the requirements.txt and setup.py file of
your
algorithm.
For the requirements.txt file, you can add the following line to the file:
git+https://github.com/STRONGAYA/v6-tools-general.git@v1.0.3
For the setup.py file, you can add the following line to the install_requires list:
"vantage6-strongaya-general @ git+https://github.com/STRONGAYA/v6-tools-general.git@v1.0.3",The algorithm's setup.py, particularly the install_requirements, section file should then look something like this:
from os import path
from codecs import open
from setuptools import setup, find_packages
# We are using a README.md, if you do not have this in your folder, simply replace this with a string.
here = path.abspath(path.dirname(__file__))
with open(path.join(here, 'README.md'), encoding='utf-8') as f:
long_description = f.read()
setup(
name='v6-not-an-actual-algorithm',
version="1.0.0",
description='Fictive Vantage6 algorithm that performs general statistics computation.',
long_description=long_description,
long_description_content_type='text/markdown',
url='https://github.com/STRONGAYA/v6-not-an-actual-algorithm',
packages=find_packages(),
python_requires='>=3.10',
install_requires=[
'vantage6-algorithm-tools',
'numpy',
'pandas',
"vantage6-strongaya-general @ git+https://github.com/STRONGAYA/v6-tools-general.git@v1.0.3"
# other dependencies
]
)Example usage of various functions in a central (aggregating) section of a Vantage6 algorithm:
from vantage6.algorithm.tools.decorators import algorithm_client
from vantage6.algorithm.client import AlgorithmClient
# General federated algorithm functions
from vantage6_strongaya_general.miscellaneous import collect_organisation_ids, safe_log
from vantage6_strongaya_general.general_statistics import compute_aggregate_general_statistics
@algorithm_client
def central(client: AlgorithmClient, variables_to_analyse: dict, variables_to_stratify: dict = None,
organisation_ids: list[int] = None) -> dict | None:
"""
Central function to compute general statistics of specified variables from multiple organisations.
Args:
client (AlgorithmClient): The client to communicate with the vantage6 server.
variables_to_analyse (dict): Dictionary of variables to analyse.
variables_to_stratify (dict, optional): Dictionary of variables to stratify. Defaults to None.
organisation_ids (list[int], optional): List of organisation IDs to include. Defaults to None.
Returns:
dict|None: A dictionary containing the aggregated EORTC QLQ scoring results,
or None if the input structure is incorrect.
"""
# Collect all organisations that participate in this collaboration unless specified
organisation_ids = collect_organisation_ids(organisation_ids, client)
# Create the subtask for general statistics
safe_log("info", "Creating subtask to calculate general statistics.")
input_ = {"method": "partial",
"kwargs": {
"variables_to_analyse": variables_to_analyse,
"variables_to_stratify": variables_to_stratify}
}
task_general_statistics = client.task.create(input_, organisation_ids,
"General Statistics",
"This subtask determines the general statistics of specified variables.")
# Wait for the node(s) to return the results of the subtask
safe_log("info", f"Waiting for results of task {task_general_statistics.get('id')}")
results_general_statistics = client.wait_for_results(task_general_statistics.get("id"))
safe_log("info", f"Results of task {task_general_statistics.get('id')} obtained")
# Aggregate the general statistics
results_general_statistics = compute_aggregate_general_statistics(results_general_statistics)
# Return the final results of the algorithm
return results_general_statisticsExample usage of various functions in a node (participating) section of a Vantage6 algorithm:
import pandas as pd
from vantage6.algorithm.tools.util import info
from vantage6.algorithm.tools.decorators import data, algorithm_client
from vantage6.algorithm.client import AlgorithmClient
# General federated algorithm functions
from vantage6_strongaya_general.general_statistics import compute_local_general_statistics
from vantage6_strongaya_general.miscellaneous import apply_data_stratification, set_datatypes, safe_log
from vantage6_strongaya_general.privacy_measures import apply_sample_size_threshold, mask_unnecessary_variables,
apply_differential_privacy
@data(1)
@algorithm_client
def partial_general_statistics(client: AlgorithmClient, df: pd.DataFrame, variables_to_analyse: dict,
variables_to_stratify: dict = None) -> dict:
"""
Execute the partial algorithm for general statistics computation.
Args:
client (AlgorithmClient): The client to communicate with the vantage6 server.
df (pd.DataFrame): The DataFrame containing the data to be processed.
variables_to_analyse (dict): Dictionary of variables to analyse.
variables_to_stratify (dict, optional): Dictionary of variables to stratify. Defaults to None.
Returns:
dict: A dictionary containing the computed general statistics.
"""
safe_log("info", "Executing partial algorithm computation of general statistics.")
# Set datatypes for each variable
df = set_datatypes(df, variables_to_analyse)
# Apply stratification if necessary
df = apply_data_stratification(df, variables_to_stratify)
# Ensure that the sample size threshold is met
df = apply_sample_size_threshold(client, df, variables_to_analyse)
# Mask unnecessary variables by removal
df = mask_unnecessary_variables(df, variables_to_analyse)
# Apply differential privacy (Laplace mechanism as per default)
df = apply_differential_privacy(df, variables_to_analyse, epsilon=1.0, return_type='dataframe')
# Compute general statistics
result = compute_local_general_statistics(df, variables_to_stratify)
return resultThe various functions are available through pip install for debugging and testing purposes.
The library can be installed as follows:
pip install git+https://github.com/STRONGAYA/v6-tools-general.gitThis repository includes a comprehensive testing framework to ensure the reliability and correctness of all functions, especially in federated scenarios.
tests/
├── conftest.py # Common fixtures and test utilities
├── unit/ # Unit tests for individual functions
│ ├── test_general_statistics.py # Tests for statistical functions
│ ├── test_miscellaneous.py # Tests for utility functions
│ └── test_privacy_measures.py # Tests for privacy functions
├── integration/ # Integration tests
│ └── test_stratification.py # Data stratification workflows
├── empirical/ # Empirical validation tests
│ └── test_federated_vs_centralised.py # Federated vs centralised comparisons
└── utils/ # Test helper utilities
└── test_helpers.py # Validation and comparison tools
Install test dependencies:
pip install pytest pytest-cov pytest-mock hypothesis faker# Run all tests
pytest
# Run unit tests only
pytest tests/unit/
# Run integration tests only
pytest tests/integration/
# Run empirical tests only
pytest tests/empirical/
# Run with coverage report
pytest --cov=vantage6_strongaya_general --cov-report=html
# Run specific test module
pytest tests/unit/test_general_statistics.py
# Run with verbose output
pytest -v- Unit Tests: Test individual functions in isolation
- Integration Tests: Test complete workflows and component interactions
- Empirical Tests: Validate federated vs centralised mathematical equivalence
- Performance Tests: Benchmark function performance with large datasets
- Edge Case Tests: Test behaviour with unusual data distributions
The test suite includes comprehensive empirical validation that federated statistical computations produce equivalent results to their centralised counterparts:
# Example: Testing federated statistics match centralised
def test_federated_equals_centralised():
# Split data across organisations
federated_data = split_by_organisation(test_data)
# Compute federated results
local_results = [compute_local_stats(org_data) for org_data in federated_data]
federated_result = aggregate_results(local_results)
# Compute centralised result
centralised_result = compute_centralised_stats(combined_data)
# Validate equivalence
assert_federated_equals_centralised(federated_result, centralised_result)The test suite uses synthetic datasets that:
- Cover various statistical distributions (normal, skewed, uniform)
- Include edge cases (small samples, missing data, outliers)
- Simulate realistic medical research scenarios
- Test privacy-preserving mechanisms
Tests run automatically on every push and pull request via GitHub Actions:
- Multiple Python versions (starting with 3.10)
- Code coverage reporting
- Performance benchmarking
- Security scanning
Some empirical tests may occasionally fail due to the inherent mathematical differences between federated and centralised computations:
Empirical Tests (tests/empirical/):
test_single_organisation_equivalence: May fail due to division-by-zero issues in quantile calculations for single organisations. This doesn't affect multi-organisation federated scenarios.test_mixed_distribution_quantiles: May fail when organisations have very different data distributions, as federated quantiles mathematically differ from centralised ones when internal distributions vary significantly.
What this means for usage:
- These failures are mathematical expectations, not bugs
- Federated quantiles with mixed distributions across organisations will naturally differ from centralised calculations
- Single organisation scenarios work correctly in practice, but may have edge cases in quantile computation
- All basic statistics (mean, count, min, max) maintain mathematical equivalence between federated and centralised approaches
- Standard deviation allows for appropriate tolerance (±15% relative, ±0.5 absolute) due to federated computation characteristics
Unit and Integration Tests: All should pass consistently, as they test core functionality and realistic federated workflows.
When contributing new functionality:
- Add unit tests for all new functions
- Add integration tests for complete workflows
- Add empirical tests for federated vs centralised scenarios
- Include edge case testing for robustness
- Update test data if needed for new scenarios
- Maintain acceptable degree of code coverage
- Use descriptive test names that explain what is being tested
- Include both positive and negative test cases
- Test edge cases and error conditions
- Use realistic synthetic data
- Mock external dependencies (AlgorithmClient, environment variables)
- Validate both structure and values of results
- J. Hogenboom
- V. Gouthamchand
- A. Lobo Gomes