This repository contains the datasets and scripts used in the following paper:
Y. Tabatabaee, E. Wedell, M. Park, T. Warnow (2025). FastEnsemble: Scalable ensemble clustering on large networks. PLOS Complex Systems 2(10): e0000069 [preliminary version appeared at International Conference on Complex Networks and their Applications (CNA) 2024] DOI: 10.1371/journal.pcsy.0000069
For experiments in this study, we generated a collection of artifical networks such as ring of cliques, Erdos-Renyi (ER) graphs, and combinations of ER graphs with LFR graphs and ring of cliques. All these datasets were generated using NetworkX. Additionally, we used a collection of 27 synthetic LFR graphs from Park et. al. (2024), that were generated based on the properties of a collection of real networks and their Leiden clusterings with different resolutions. These datasets are available at Illinois Data Bank.
This repository includes the new datasets generated for this study, as well as the output of different clustering methods in each experiment.
Each directory includes the datasets and results for one set of networks used in this study. Each subdirectory is a model condition for that dataset. Below is a description of each directory:
- lfr_training/: LFR algorithm design datasets with mixing parameters varying between
0.1and0.9. Thedefaultmodel condition has 10,000 nodes and average degree of 10. The model conditions named asd_[AVG-DEG]have 10,000 nodes with average degree of[AVG-DEG](5 or 20) and the model conditions named asn_[NUM-NODES]have average degree of 10 with[NUM-NODES]nodes (1000 or 100,000). - erdos_renyi/: Erdos-Renyi networks with 1000 nodes and density (
p) varying between0.001and0.1. - erdos_renyi_lfr/: Erdos-Renyi network of size 1000 with density (
p) varying between0.001and0.1connected to an LFR graph of size 1000. - erdos_renyi_ring/: Erdos-Renyi network of size 1000 with density (
p) varying between0.001and0.1connected to a Ring-of-Cliques network with 100 cliques of size 10. - tandon_et_al/: Reproduction of the 10,000 node LFR datasets from Tandon et al. (2019).
- tree_mod/: Tree-of-Cliques networks with number of nodes (
n) varying between 90 and 5000 and cliques of size 10 used in modularity experiments. - ring_mod/: Ring-of-Cliques networks with number of nodes (
n) varying between 90 and 10,000 and cliques of size 10 used in modularity experiments. - ring_cpm/ and ring_cpm_res/: Ring-of-Cliques networks with number of nodes (
n) varying between 90 and 10,000 and cliques of size 10 used in the CPM experiments. - park_et_al_CM/: 27 synthetic LFR graphs from Park et. al. (2024), that were generated based on the properties of a collection of real networks and their Leiden clusterings with different resolutions. The subdirectories are named as
[network-name]_[resolution]_lfrwith the resolution field varyhing betweenmodularityor0.5to0.0001(for Leiden-CPM). - real_networks/: Real networks, including a network of Amazon Products (amazon/) and the DBLP collaboration network (DBLP/)
Below is the description of files in each directory:
network.dat: network edge-listcommunity.dat: ground-truth community structure in the form ofnode:membershipecg.dat: result of clustering with ECGmu_dist.csv: distribution of mixing parameter values (mu) for the network/ground-truth communitiesleiden_mod.dat: result of clustering with Leiden-modularityfec_leiden_[PARAM].dat: result of clustering with FastEnsemble with Leiden-modularity or Leiden-CPMfec_nw_leiden_[PARAM].dat: result of clustering with FastEnsemble with Leiden-modularity or Leiden-CPM without weightingoriginal_fastconsensus_louvain.clustering: result of clustering with FastConsensusstrict_np[NUM-E]_leiden_mod.dat: result of clustering with strict consensus with [NUM-E] ensembleslouvain.dat: result of clustering with the Louvain algorithm