Skip to content

Tools for Resource Description Framework (RDF) data handling in federated learning using Flyover and Vantage6.

License

Notifications You must be signed in to change notification settings

STRONGAYA/v6-tools-rdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STRONG AYA's RDF Vantage6 tools

Test status Python 3.10+ Licence: Apache 2.0
Vantage6 4.11, 4.12 Flyover version 2.0+ STRONG AYA Algorithm Guideline Conformity: v1.0.1 Pending
Code style: black Linting: flake8 Type checking: mypy Security: bandit Security: safety

Purpose of this repository

This repository contains resource description framework (RDF) functionalities and tools for the STRONG AYA project. They are designed to be used with the Vantage6 framework for federated analytics and learning and are intended to facilitate and simplify the development of Vantage6 algorithms. The SPARQL queries and RDF functionalities are designed to be used in conjunction with the Flyover and Triplifier tools.

The code in this repository is available as a Python library here on GitHub or through direct reference with pip.

Structure of the repository

The various functions are organised in different sections, consisting of:

  • RDF Data Collection: Functions to formulate and execute a SPARQL query on an RDF/SPARQL endpoint;
  • Data Processing: Functions to process the output of an RDF/SPARQL endpoint (e.g. determine missing values, extract associated subclasses);
  • Query Templates: SPARQL query templates that the SPARQL data collection section uses

Usage

The library provides functions that can be included in a Vantage6 algorithm as the algorithm developer sees fit. The functions are designed to be modular and can be used independently or in combination with other functions.

The library can be included in your Vantage6 algorithm by listing it in the requirements.txt and setup.py file of your algorithm.

Including the library in your Vantage6 algorithm

For the requirements.txt file, you can add the following line to the file:

git+https://github.com/STRONGAYA/v6-tools-rdf.git@v1.0.1

For the setup.py file, you can add the following line to the install_requires list:

        "vantage6-strongaya-rdf @ git+https://github.com/STRONGAYA/v6-tools-rdf.git@v1.0.1",

The algorithm's setup.py, particularly the install_requirements, section file should then look something like this:

from os import path
from codecs import open
from setuptools import setup, find_packages

# We are using a README.md, if you do not have this in your folder, simply replace this with a string.
here = path.abspath(path.dirname(__file__))
with open(path.join(here, 'README.md'), encoding='utf-8') as f:
    long_description = f.read()
setup(
    name='v6-not-an-actual-algorithm',
    version="1.0.1",
    description='Fictive Vantage6 algorithm that performs general statistics computation.',
    long_description=long_description,
    long_description_content_type='text/markdown',
    url='https://github.com/STRONGAYA/v6-not-an-actual-algorithm',
    packages=find_packages(),
    python_requires='>=3.10',
    install_requires=[
        'vantage6-algorithm-tools',
        'numpy',
        'pandas',
        "vantage6-strongaya-rdf @ git+https://github.com/STRONGAYA/v6-tools-rdf.git@v1.0.1"
        # other dependencies
    ]
)

Central (aggregating) example

The functions included in this library focus on extracting RDF data from a SPARQL endpoint. It is not recommended to use these functions in the central (aggregating) section of a Vantage6 algorithm.

Node or local (participating) example

Example usage of the SPARQL data collection function in a node (participating) section of a Vantage6 algorithm:

# General federated algorithm functions
from vantage6_strongaya_general.miscellaneous import safe_log
from vantage6_strongaya_rdf.collect_sparql_data import collect_sparql_data


def partial_general_statistics(variables_to_analyse: dict) -> dict:
    """
    Execute the partial algorithm for some modelling using RDF data.

    Args:
        variables_to_analyse (list): List of variables to analyse.

    Returns:
        dict: A dictionary containing the computed general statistics.
    """
    safe_log("info", "Executing partial algorithm for some modelling using RDF data.")

    # Set datatypes for each variable
    df = collect_sparql_data(variables_to_analyse, query_type="single_column",
                             endpoint="http://localhost:7200/repositories/userRepo",
                             )

    # Ensure that the desired privacy measures are applied

    # Do some modelling of the data

    return result

The various functions are available through pip install for debugging and testing purposes. The library can be installed as follows:

pip install git+https://github.com/STRONGAYA/v6-tools-rdf.git

Testing

This repository includes a comprehensive testing framework to ensure the reliability and correctness of all functions, especially in whether RDF-data is queryable when the library is run as a Docker container within a Vantage6 node.

Test Structure

tests/
├── conftest.py                           # Common fixtures and test utilities
├── unit/                                 # Unit tests for individual functions
│   ├── test_library_functions.py         # Tests for library functions
├── integration/                          # Integration tests
│   └── test_vantage6_integration.py      # Data stratification workflows
│   └── test_rdf_algorithm_integration.py # Vantage6 algorithm integration tests
├── mock_algorithm/                       # Mock Vantage6 algorithm to be used for Vantage6 integration testing
│   └── ...                               
└── data/                                 # Test data and configurations
    └── additional_vantage6_*_config.yaml # Additional Vantage6 component configurations
    └── *.ttl                             # Triplified datasets for testing
    └── rdf_store.csv                     # RDF-store reference for the Vantage6 node

Running Tests

Prerequisites

Install test dependencies:

pip install pytest pytest-mock hypothesis faker

Basic Test Execution

# Run all tests
pytest

# Run unit tests only
pytest tests/unit/

# Run integration tests only
pytest tests/integration/

# Run specific test module
pytest tests/unit/test_library_functions.py

# Run with verbose output
pytest -v

Test Categories

  • Unit Tests: Test individual functions in isolation
  • Integration Tests: Test complete workflows and component interactions (whether data can be queried from the RDF-store in a Vantage6 node)
  • Edge Case Tests: Test behaviour with unusual data inputs

Test Data

The test suite uses a synthetic dataset that was triplified using the Triplifier tool.

Continuous Integration

Tests run automatically on every push and pull request via GitHub Actions:

  • Multiple Python and Vantage6 versions (starting with Python 3.10 and Vantage6 4.11 and 4.12)
  • Code coverage reporting
  • Performance benchmarking
  • Security scanning

Contributing to Tests

When contributing new functionality:

  1. Add unit tests for all new functions
  2. Add integration tests for complete workflows
  3. Include edge case testing for robustness
  4. Ensure new query templates have corresponding tests
  5. Update test data if needed for new scenarios; ensure that this is triplified.
  6. Ensure that the mock algorithm in tests/mock_algorithm covers the new functionality

Test Guidelines

  • Use descriptive test names that explain what is being tested
  • Include both positive and negative test cases and scenarios
  • Test edge cases and error conditions
  • Use realistic synthetic data
  • Validate both structure and values of results

Contributors

  • J. Hogenboom
  • V. Gouthamchand

References

About

Tools for Resource Description Framework (RDF) data handling in federated learning using Flyover and Vantage6.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •