This repository contains resource description framework (RDF) functionalities and tools for the STRONG AYA project. They are designed to be used with the Vantage6 framework for federated analytics and learning and are intended to facilitate and simplify the development of Vantage6 algorithms. The SPARQL queries and RDF functionalities are designed to be used in conjunction with the Flyover and Triplifier tools.
The code in this repository is available as a Python library here on GitHub or through direct reference with pip.
The various functions are organised in different sections, consisting of:
- RDF Data Collection: Functions to formulate and execute a SPARQL query on an RDF/SPARQL endpoint;
- Data Processing: Functions to process the output of an RDF/SPARQL endpoint (e.g. determine missing values, extract associated subclasses);
- Query Templates: SPARQL query templates that the SPARQL data collection section uses
The library provides functions that can be included in a Vantage6 algorithm as the algorithm developer sees fit. The functions are designed to be modular and can be used independently or in combination with other functions.
The library can be included in your Vantage6 algorithm by listing it in the requirements.txt and setup.py file of
your
algorithm.
For the requirements.txt file, you can add the following line to the file:
git+https://github.com/STRONGAYA/v6-tools-rdf.git@v1.0.1
For the setup.py file, you can add the following line to the install_requires list:
"vantage6-strongaya-rdf @ git+https://github.com/STRONGAYA/v6-tools-rdf.git@v1.0.1",The algorithm's setup.py, particularly the install_requirements, section file should then look something like this:
from os import path
from codecs import open
from setuptools import setup, find_packages
# We are using a README.md, if you do not have this in your folder, simply replace this with a string.
here = path.abspath(path.dirname(__file__))
with open(path.join(here, 'README.md'), encoding='utf-8') as f:
long_description = f.read()
setup(
name='v6-not-an-actual-algorithm',
version="1.0.1",
description='Fictive Vantage6 algorithm that performs general statistics computation.',
long_description=long_description,
long_description_content_type='text/markdown',
url='https://github.com/STRONGAYA/v6-not-an-actual-algorithm',
packages=find_packages(),
python_requires='>=3.10',
install_requires=[
'vantage6-algorithm-tools',
'numpy',
'pandas',
"vantage6-strongaya-rdf @ git+https://github.com/STRONGAYA/v6-tools-rdf.git@v1.0.1"
# other dependencies
]
)The functions included in this library focus on extracting RDF data from a SPARQL endpoint. It is not recommended to use these functions in the central (aggregating) section of a Vantage6 algorithm.
Example usage of the SPARQL data collection function in a node (participating) section of a Vantage6 algorithm:
# General federated algorithm functions
from vantage6_strongaya_general.miscellaneous import safe_log
from vantage6_strongaya_rdf.collect_sparql_data import collect_sparql_data
def partial_general_statistics(variables_to_analyse: dict) -> dict:
"""
Execute the partial algorithm for some modelling using RDF data.
Args:
variables_to_analyse (list): List of variables to analyse.
Returns:
dict: A dictionary containing the computed general statistics.
"""
safe_log("info", "Executing partial algorithm for some modelling using RDF data.")
# Set datatypes for each variable
df = collect_sparql_data(variables_to_analyse, query_type="single_column",
endpoint="http://localhost:7200/repositories/userRepo",
)
# Ensure that the desired privacy measures are applied
# Do some modelling of the data
return resultThe various functions are available through pip install for debugging and testing purposes.
The library can be installed as follows:
pip install git+https://github.com/STRONGAYA/v6-tools-rdf.gitThis repository includes a comprehensive testing framework to ensure the reliability and correctness of all functions, especially in whether RDF-data is queryable when the library is run as a Docker container within a Vantage6 node.
tests/
├── conftest.py # Common fixtures and test utilities
├── unit/ # Unit tests for individual functions
│ ├── test_library_functions.py # Tests for library functions
├── integration/ # Integration tests
│ └── test_vantage6_integration.py # Data stratification workflows
│ └── test_rdf_algorithm_integration.py # Vantage6 algorithm integration tests
├── mock_algorithm/ # Mock Vantage6 algorithm to be used for Vantage6 integration testing
│ └── ...
└── data/ # Test data and configurations
└── additional_vantage6_*_config.yaml # Additional Vantage6 component configurations
└── *.ttl # Triplified datasets for testing
└── rdf_store.csv # RDF-store reference for the Vantage6 node
Install test dependencies:
pip install pytest pytest-mock hypothesis faker# Run all tests
pytest
# Run unit tests only
pytest tests/unit/
# Run integration tests only
pytest tests/integration/
# Run specific test module
pytest tests/unit/test_library_functions.py
# Run with verbose output
pytest -v- Unit Tests: Test individual functions in isolation
- Integration Tests: Test complete workflows and component interactions (whether data can be queried from the RDF-store in a Vantage6 node)
- Edge Case Tests: Test behaviour with unusual data inputs
The test suite uses a synthetic dataset that was triplified using the Triplifier tool.
Tests run automatically on every push and pull request via GitHub Actions:
- Multiple Python and Vantage6 versions (starting with Python 3.10 and Vantage6 4.11 and 4.12)
- Code coverage reporting
- Performance benchmarking
- Security scanning
When contributing new functionality:
- Add unit tests for all new functions
- Add integration tests for complete workflows
- Include edge case testing for robustness
- Ensure new query templates have corresponding tests
- Update test data if needed for new scenarios; ensure that this is triplified.
- Ensure that the mock algorithm in
tests/mock_algorithmcovers the new functionality
- Use descriptive test names that explain what is being tested
- Include both positive and negative test cases and scenarios
- Test edge cases and error conditions
- Use realistic synthetic data
- Validate both structure and values of results
- J. Hogenboom
- V. Gouthamchand