agh-a2s · WojtAcht · Feb 28, 2025 · Nov 23, 2024 · Dec 1, 2024 · Feb 24, 2025
diff --git a/README.md b/README.md
@@ -1,53 +1,46 @@
 # distribution-optimization-py: Optimization-Based GMM Parameter Estimation
 
-A Gaussian Mixture Model (GMM) is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. The GMM is here rephrased as an optimization problem and solved using a Hierarchic Memetic Strategy (HMS). This approach diverges from the traditional Likelihood Maximization, Expectation Maximization approach. Instead, it introduces a fitness function that combines the chi-square test, used for analyzing distributions, with an innovative metric designed to estimate the overlapping area under the curves among various Gaussian distributions. This library offers an implementation of [DistributionOptimization](https://cran.r-project.org/web/packages/DistributionOptimization/index.html) in Python (with significant improvements).
+## Installation
 
-### Experiment results
-
-The results of the experiments can be found on the `ela-analysis` branch.
-
-### Quick start
-Install:
+```bash
+git clone git+https://github.com/agh-a2s/distribution-optimization.git
+```
 
+To install dependencies
 ```
-pip install git+https://github.com/WojtAcht/distribution-optimization-py
+poetry install
 ```
 
-Create dataset:
-
-```python
-import numpy as np
-random_state = np.random.RandomState(seed=1)
-textbook_data = np.concatenate(
-    [
-        random_state.normal(-1, 1.5, 350),
-        random_state.normal(0, 1, 500),
-        random_state.normal(3, 0.5, 150),
-    ]
-)
-```
+## Reproducing Experiments and Plots
+
+### Generating Datasets
+
+Synthetic datasets are available in `results` directory. The generation can be reproduced by running `poetry run python3 -m distribution_optimization_py.experiment`.
 
-Fit GMM using HMS and plot results:
+### Benchmarking Optimization Algorithms
 
-```python
-from distribution_optimization_py.gaussian_mixture import GaussianMixture
-gmm = GaussianMixture(n_components=3, algorithm="HMS", random_state=42)
-gmm.fit(textbook_data)
-probabilities = gmm.predict_proba(textbook_data)
-gmm.plot()
+To benchmark optimization algorithms on GMM parameter estimation run `poetry run python3 -m distribution_optimization_py.solver`.
+
+### Reproducing Plots from the Paper
+
+Fig. 1:
+```bash
+poetry run python3 -m distribution_optimization_py.plot_binning_scheme_difference
+```
+Fig. 2:
+```bash
+poetry run python3 -m distribution_optimization_py.experiment.plot
 ```
 
-![GMM Plot](images/plot.png)
+Fig. 3:
+```bash
+poetry run python3 -m distribution_optimization_py.solver.plot
+```
 
-### Supported optimization algorithms
+Fig. 3:
+```bash
+poetry run python3 -m distribution_optimization_py.compare_results
+```
 
-| algorithm | Description                                                  | library   |
-|-----------|--------------------------------------------------------------|-----------|
-| "HMS"     | Hierarchic Memetic Strategy                                  | `pyhms`   |
-| "CMA-ES"  | Covariance Matrix Adaptation Evolution Strategy              | `pycma`   |
-| "GA"      | DistributionOptimization evolutionary algorithm              | `leap_ec` |
-| "DE"    | Differential Evolution iL-SHADE                                | `pyade.ilshade` |
+All these scripts will create plots in the `images` directory.
 
-### Relevant literature
-- Lerch, F., Ultsch, A. & Lötsch, J. Distribution Optimization: An evolutionary algorithm to separate Gaussian mixtures. Sci Rep 10, 648 (2020). doi: [10.1038/s41598-020-57432-w](https://doi.org/10.1038/s41598-020-57432-w)
-- J. Sawicki, M. Łoś, M. Smołka, J. Alvarez-Aramberri. Using Covariance Matrix Adaptation Evolutionary Strategy to boost the search accuracy in hierarchic memetic computations. Journal of computational science, 34, 48-54, 2019. doi: [10.1016/j.jocs.2019.04.005](https://doi.org/10.1016/j.jocs.2019.04.005)
diff --git a/distribution_optimization_py/compare_results.py b/distribution_optimization_py/compare_results.py
@@ -0,0 +1,155 @@
+import random
+
+import numpy as np
+import pandas as pd
+import plotly.graph_objects as go
+import plotly.io as pio
+from distribution_optimization_py.datasets import DATASETS
+from distribution_optimization_py.gaussian_mixture import GaussianMixture
+from distribution_optimization_py.problem import GaussianMixtureProblem
+from distribution_optimization_py.solver import GASolver, iLSHADESolver
+from plotly.subplots import make_subplots
+
+pio.kaleido.scope.mathjax = None
+
+np.random.seed(42)
+random.seed(42)
+
+DATASET_NAME_TO_TITLE = {
+    "truck_driving_data": "Truck Driving",
+    "mixture3": "Mixture 3",
+    "textbook_data": "Textbook Data",
+    "iris_ica": "Iris ICA",
+    "chromatogram_time": "Chromatogram Time",
+    "atmosphere_data": "Atmosphere Data",
+}
+
+if __name__ == "__main__":
+    datasets = [DATASETS[0], DATASETS[3], DATASETS[4]]
+    fig = make_subplots(
+        rows=3,
+        cols=1,
+        subplot_titles=[DATASET_NAME_TO_TITLE.get(d.name) for d in datasets],
+        vertical_spacing=0.08,
+    )
+
+    for subplot_idx, dataset in enumerate(datasets, 1):
+        data = dataset.data
+        nr_of_modes = dataset.nr_of_modes
+        old_problem = GaussianMixtureProblem(
+            data=data,
+            nr_of_modes=nr_of_modes,
+            bin_type="equal_width",
+            bin_number_method="keating",
+        )
+        new_problem = GaussianMixtureProblem(
+            data=data,
+            nr_of_modes=nr_of_modes,
+            bin_type="equal_probability",
+            bin_number_method="mann_wald",
+        )
+        old_solver = GASolver()
+        new_solver = iLSHADESolver()
+        old_solutions = []
+        new_solutions = []
+
+        for i in range(10):
+            random_state = 42 + i
+            old_solution = old_solver(
+                old_problem, max_n_evals=10000, random_state=random_state
+            )
+            new_solution = new_solver(
+                new_problem, max_n_evals=10000, random_state=random_state
+            )
+            old_solutions.append(old_solution.genome)
+            new_solutions.append(new_solution.genome)
+
+        all_solutions = old_solutions + new_solutions
+        all_labels = [f"Keating-{i+1}" for i in range(10)] + [
+            f"Mann-Wald-{i+1}" for i in range(10)
+        ]
+        bins = 75
+
+        gmms = [
+            GaussianMixture(n_components=nr_of_modes, random_state=1).set_params(
+                data, solution
+            )
+            for solution in all_solutions
+        ]
+
+        x = np.linspace(data.min(), data.max(), 1000)
+        pdfs = [gmm.score_samples(x) for gmm in gmms]
+        hist_data = np.histogram(data, bins=bins, density=True)
+
+        data_color = "rgba(31, 119, 180, 0.3)"
+        keating_color = "#ff7f0e"
+        mann_wald_color = "#2ca02c"
+
+        fig.add_trace(
+            go.Bar(
+                x=hist_data[1][:-1],
+                y=hist_data[0],
+                opacity=0.7,
+                name="Empirical Data",
+                marker_color=data_color,
+                marker_line_width=0.2,
+                legendgroup="data",
+                showlegend=(subplot_idx == 1),
+            ),
+            row=subplot_idx,
+            col=1,
+        )
+
+        for i, (pdf, label) in enumerate(zip(pdfs, all_labels)):
+            dash_pattern = "dashdot"
+            color = keating_color if "Keating" in label else mann_wald_color
+
+            show_legend = subplot_idx == 1 and (
+                (i == 0 and "Keating" in label) or (i == 10 and "Mann-Wald" in label)
+            )
+            legend_group = (
+                "GA + Keating" if "Keating" in label else "iLSHADE + Mann-Wald"
+            )
+            legend_name = (
+                "GA + Keating" if "Keating" in label else "iLSHADE + Mann-Wald"
+            )
+
+            fig.add_trace(
+                go.Scatter(
+                    x=x.flatten(),
+                    y=pdf,
+                    mode="lines",
+                    name=legend_name if show_legend else None,
+                    legendgroup=legend_group,
+                    showlegend=show_legend,
+                    line=dict(color=color, dash=dash_pattern, width=1.5),
+                ),
+                row=subplot_idx,
+                col=1,
+            )
+
+    fig.update_layout(
+        height=1200,
+        width=1400,
+        showlegend=True,
+        template="plotly_white",
+        font=dict(size=20),
+        legend=dict(
+            yanchor="top",
+            y=0.99,
+            xanchor="left",
+            x=1.02,
+            font=dict(size=18),
+        ),
+        margin=dict(l=60, r=100, t=60, b=60),
+    )
+
+    # Increase font size for subplot titles
+    for i in range(len(fig.layout.annotations)):
+        fig.layout.annotations[i].font.size = 22
+
+    for i in range(1, 4):
+        fig.update_xaxes(title_text="X" if i == 3 else None, row=i, col=1)
+        fig.update_yaxes(title_text="Density" if i == 2 else None, row=i, col=1)
+
+    fig.write_image(f"./images/comparison_plot.eps", scale=2)
diff --git a/distribution_optimization_py/experiment/__init__.py b/distribution_optimization_py/experiment/__init__.py
diff --git a/distribution_optimization_py/experiment/__main__.py b/distribution_optimization_py/experiment/__main__.py
@@ -0,0 +1,26 @@
+from multiprocessing import Pool
+
+from .method import run_experiment_for_difficult_examples, run_experiment_for_nr_of_components
+
+if __name__ == "__main__":
+    components_to_test = list(range(2, 11))
+    components_to_nr_of_datasets = {i: 100 for i in components_to_test}
+    min_overlap_value = 0.0
+    max_overlap_value = 0.49
+    results_dir_name = "results"
+    arg_list = [
+        (
+            nr_of_components,
+            components_to_nr_of_datasets[nr_of_components],
+            min_overlap_value,
+            max_overlap_value,
+            results_dir_name,
+        )
+        for nr_of_components in components_to_test
+    ]
+
+    with Pool(10) as p:
+        p.starmap(
+            run_experiment_for_nr_of_components,
+            arg_list,
+        )