This project demonstrates Federated Learning (FL) with k-Means clustering using Scikit-learn. The aggregation follows the MiniBatch k-Means approach, where clients perform local training, and a central server aggregates their results to update global cluster centers.
Each FL round consists of:
- Local Training: Clients initialize with global centers and train MiniBatchKMeans on local data.
- Global Aggregation: The server collects cluster centers and counts from all clients, updates the global model, and redistributes it for the next round.
- Clients use k-means++ for initial cluster centers.
- The server aggregates initial centers using a round of k-means to determine the global starting point.
DfAnalyzer is a library designed to capture provenance data, which includes:
- Prospective Provenance: The "recipe" of the trial, capturing configurations before execution.
- Retrospective Provenance: Logs and tracks execution results, including data transformations and FL training steps.
-
Load the DfAnalyzer Docker image:
docker pull nymeria0042/dfanalyzer
-
Deploy the DfAnalyzer container:
cd dfanalyzer && docker compose up dfanalyzer
-
Ensure DfAnalyzer is running in the background before starting trials.
-
Create and activate a virtualenv with python=3.8
virtualenv venv --python=3.8 . venv/bin/activate pip install -r requirements.txt -
Install
dfa-lib-python:
cd dfanalyzer/dfa-lib/python & make install-
Run the prospective provenance script:
python fed-clustering/utils/prospective_provenance.py
- Responsible for capturing and recording metadata about the design and configuration of the trials before the runs.
NVFlare is used to set up the federated learning infrastructure.
- Build the NVFlare image
docker build -t nvflare-service .From fed-clustering folder.
source start_trial.sh-
If versioning_control is enabled in
utils/start_trial.py, a new branch is created under thetrials/folder, named with the user and the trial’s start timestamp. A hash is generated and stored intrial_info.json. Additionally, a commit is created containing this hash. The hash can be used to query the provenance database for records related exclusively to that trial. -
If versioning_control is disabled, the trial runs in the current folder and branch without creating a new branch or commit. Nonetheless, a hash is still generated and stored in
trial_info.json, enabling tracking and querying in the provenance database. -
Remember to reactivate the virtual environment
source prepare_data.sh
source prepare_job_config.sh- Run:
nvflare provisionThis creates a workspace/fed_clustering directory with the following structure:
workspace
└── fed_clustering
├── prod_01
│ ├── admin@nvidia.com
│ ├── server1
│ ├── site-1
│ ├── site-2
│ ├── overseer
│ └── compose.yaml
└── resources
- Manually copy
workspace/fed_clustering/prod_00/.envandworkspace/fed_clustering/prod_00/compoose.yamlto the newworkspace/fed_clustering/prod_01/folder
- Navigate to
prod_01and launch FL components:
docker compose up- Manually copy
jobs/sklearn_2_uniformtoworkspace/fed_clustering/prod_01/admin@nvidia.com/transfer/.
- Create dataset folders inside the containers:
docker exec -it site-1 mkdir -p /tmp/nvflare/dataset- Copy data to each client:
docker cp /tmp/nvflare/dataset/des.csv site-1:/tmp/nvflare/dataset- Repeat for all sites.
- Get the local hostname:
hostname -I | awk '{print $1}'
- Add it to
/etc/hosts:Add:sudo vim /etc/hosts
{IP} server1 overseer
Inside prod_01, start the FL admin panel:
./admin@nvidia.com/startup/fl_admin.shLog in with:
admin@nvidia.com
check_status [server|client]submit_job sklearn_kmeans_2_uniformdocker exec -it dfanalyzer mclient -u monetdb -d dataflow_analyzerThe default password is monetdb.
Then, we can submit the queries, like:
SELECT client_id, silhouette_score FROM iClientValidation WHERE trial_id = {hash_trial};docker exec -it dfanalyzer mclient -u monetdb -d dataflow_analyzer -i save_results.sqlThis creates a folder results inside the dfanalyzer directory with .csv files for each predefined provenance table
This project demonstrates federated k-Means clustering using NVFlare, Scikit-learn, and DfAnalyzer. Provenance data is captured throughout, ensuring transparency and reproducibility of FL trials.
This project builds upon NVidia Flare.