Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
71 changes: 32 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,53 +1,46 @@
# distribution-optimization-py: Optimization-Based GMM Parameter Estimation

A Gaussian Mixture Model (GMM) is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. The GMM is here rephrased as an optimization problem and solved using a Hierarchic Memetic Strategy (HMS). This approach diverges from the traditional Likelihood Maximization, Expectation Maximization approach. Instead, it introduces a fitness function that combines the chi-square test, used for analyzing distributions, with an innovative metric designed to estimate the overlapping area under the curves among various Gaussian distributions. This library offers an implementation of [DistributionOptimization](https://cran.r-project.org/web/packages/DistributionOptimization/index.html) in Python (with significant improvements).
## Installation

### Experiment results

The results of the experiments can be found on the `ela-analysis` branch.

### Quick start
Install:
```bash
git clone git+https://github.com/agh-a2s/distribution-optimization.git
```

To install dependencies
```
pip install git+https://github.com/WojtAcht/distribution-optimization-py
poetry install
```

Create dataset:

```python
import numpy as np
random_state = np.random.RandomState(seed=1)
textbook_data = np.concatenate(
[
random_state.normal(-1, 1.5, 350),
random_state.normal(0, 1, 500),
random_state.normal(3, 0.5, 150),
]
)
```
## Reproducing Experiments and Plots

### Generating Datasets

Synthetic datasets are available in `results` directory. The generation can be reproduced by running `poetry run python3 -m distribution_optimization_py.experiment`.

Fit GMM using HMS and plot results:
### Benchmarking Optimization Algorithms

```python
from distribution_optimization_py.gaussian_mixture import GaussianMixture
gmm = GaussianMixture(n_components=3, algorithm="HMS", random_state=42)
gmm.fit(textbook_data)
probabilities = gmm.predict_proba(textbook_data)
gmm.plot()
To benchmark optimization algorithms on GMM parameter estimation run `poetry run python3 -m distribution_optimization_py.solver`.

### Reproducing Plots from the Paper

Fig. 1:
```bash
poetry run python3 -m distribution_optimization_py.plot_binning_scheme_difference
```
Fig. 2:
```bash
poetry run python3 -m distribution_optimization_py.experiment.plot
```

![GMM Plot](images/plot.png)
Fig. 3:
```bash
poetry run python3 -m distribution_optimization_py.solver.plot
```

### Supported optimization algorithms
Fig. 3:
```bash
poetry run python3 -m distribution_optimization_py.compare_results
```

| algorithm | Description | library |
|-----------|--------------------------------------------------------------|-----------|
| "HMS" | Hierarchic Memetic Strategy | `pyhms` |
| "CMA-ES" | Covariance Matrix Adaptation Evolution Strategy | `pycma` |
| "GA" | DistributionOptimization evolutionary algorithm | `leap_ec` |
| "DE" | Differential Evolution iL-SHADE | `pyade.ilshade` |
All these scripts will create plots in the `images` directory.

### Relevant literature
- Lerch, F., Ultsch, A. & Lötsch, J. Distribution Optimization: An evolutionary algorithm to separate Gaussian mixtures. Sci Rep 10, 648 (2020). doi: [10.1038/s41598-020-57432-w](https://doi.org/10.1038/s41598-020-57432-w)
- J. Sawicki, M. Łoś, M. Smołka, J. Alvarez-Aramberri. Using Covariance Matrix Adaptation Evolutionary Strategy to boost the search accuracy in hierarchic memetic computations. Journal of computational science, 34, 48-54, 2019. doi: [10.1016/j.jocs.2019.04.005](https://doi.org/10.1016/j.jocs.2019.04.005)
155 changes: 155 additions & 0 deletions distribution_optimization_py/compare_results.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
import random

import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.io as pio
from distribution_optimization_py.datasets import DATASETS
from distribution_optimization_py.gaussian_mixture import GaussianMixture
from distribution_optimization_py.problem import GaussianMixtureProblem
from distribution_optimization_py.solver import GASolver, iLSHADESolver
from plotly.subplots import make_subplots

pio.kaleido.scope.mathjax = None

np.random.seed(42)
random.seed(42)

DATASET_NAME_TO_TITLE = {
"truck_driving_data": "Truck Driving",
"mixture3": "Mixture 3",
"textbook_data": "Textbook Data",
"iris_ica": "Iris ICA",
"chromatogram_time": "Chromatogram Time",
"atmosphere_data": "Atmosphere Data",
}

if __name__ == "__main__":
datasets = [DATASETS[0], DATASETS[3], DATASETS[4]]
fig = make_subplots(
rows=3,
cols=1,
subplot_titles=[DATASET_NAME_TO_TITLE.get(d.name) for d in datasets],
vertical_spacing=0.08,
)

for subplot_idx, dataset in enumerate(datasets, 1):
data = dataset.data
nr_of_modes = dataset.nr_of_modes
old_problem = GaussianMixtureProblem(
data=data,
nr_of_modes=nr_of_modes,
bin_type="equal_width",
bin_number_method="keating",
)
new_problem = GaussianMixtureProblem(
data=data,
nr_of_modes=nr_of_modes,
bin_type="equal_probability",
bin_number_method="mann_wald",
)
old_solver = GASolver()
new_solver = iLSHADESolver()
old_solutions = []
new_solutions = []

for i in range(10):
random_state = 42 + i
old_solution = old_solver(
old_problem, max_n_evals=10000, random_state=random_state
)
new_solution = new_solver(
new_problem, max_n_evals=10000, random_state=random_state
)
old_solutions.append(old_solution.genome)
new_solutions.append(new_solution.genome)

all_solutions = old_solutions + new_solutions
all_labels = [f"Keating-{i+1}" for i in range(10)] + [
f"Mann-Wald-{i+1}" for i in range(10)
]
bins = 75

gmms = [
GaussianMixture(n_components=nr_of_modes, random_state=1).set_params(
data, solution
)
for solution in all_solutions
]

x = np.linspace(data.min(), data.max(), 1000)
pdfs = [gmm.score_samples(x) for gmm in gmms]
hist_data = np.histogram(data, bins=bins, density=True)

data_color = "rgba(31, 119, 180, 0.3)"
keating_color = "#ff7f0e"
mann_wald_color = "#2ca02c"

fig.add_trace(
go.Bar(
x=hist_data[1][:-1],
y=hist_data[0],
opacity=0.7,
name="Empirical Data",
marker_color=data_color,
marker_line_width=0.2,
legendgroup="data",
showlegend=(subplot_idx == 1),
),
row=subplot_idx,
col=1,
)

for i, (pdf, label) in enumerate(zip(pdfs, all_labels)):
dash_pattern = "dashdot"
color = keating_color if "Keating" in label else mann_wald_color

show_legend = subplot_idx == 1 and (
(i == 0 and "Keating" in label) or (i == 10 and "Mann-Wald" in label)
)
legend_group = (
"GA + Keating" if "Keating" in label else "iLSHADE + Mann-Wald"
)
legend_name = (
"GA + Keating" if "Keating" in label else "iLSHADE + Mann-Wald"
)

fig.add_trace(
go.Scatter(
x=x.flatten(),
y=pdf,
mode="lines",
name=legend_name if show_legend else None,
legendgroup=legend_group,
showlegend=show_legend,
line=dict(color=color, dash=dash_pattern, width=1.5),
),
row=subplot_idx,
col=1,
)

fig.update_layout(
height=1200,
width=1400,
showlegend=True,
template="plotly_white",
font=dict(size=20),
legend=dict(
yanchor="top",
y=0.99,
xanchor="left",
x=1.02,
font=dict(size=18),
),
margin=dict(l=60, r=100, t=60, b=60),
)

# Increase font size for subplot titles
for i in range(len(fig.layout.annotations)):
fig.layout.annotations[i].font.size = 22

for i in range(1, 4):
fig.update_xaxes(title_text="X" if i == 3 else None, row=i, col=1)
fig.update_yaxes(title_text="Density" if i == 2 else None, row=i, col=1)

fig.write_image(f"./images/comparison_plot.eps", scale=2)
Empty file.
26 changes: 26 additions & 0 deletions distribution_optimization_py/experiment/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
from multiprocessing import Pool

from .method import run_experiment_for_difficult_examples, run_experiment_for_nr_of_components

if __name__ == "__main__":
components_to_test = list(range(2, 11))
components_to_nr_of_datasets = {i: 100 for i in components_to_test}
min_overlap_value = 0.0
max_overlap_value = 0.49
results_dir_name = "results"
arg_list = [
(
nr_of_components,
components_to_nr_of_datasets[nr_of_components],
min_overlap_value,
max_overlap_value,
results_dir_name,
)
for nr_of_components in components_to_test
]

with Pool(10) as p:
p.starmap(
run_experiment_for_nr_of_components,
arg_list,
)
Loading
Loading