Skip to content

Commit 153c2d9

Browse files
authored
Merge pull request #73 from cmccomb/codex/extend-pipeline-with-parameterized-preprocessing-steps
Add configurable preprocessing steps
2 parents 7351143 + 4e7812c commit 153c2d9

File tree

6 files changed

+1675
-16
lines changed

6 files changed

+1675
-16
lines changed

README.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,75 @@ Pipelines preserve the order of steps. Stateful steps such as PCA, SVD, or
9494
standardization automatically fit during training and reuse the same fitted
9595
state when you call `predict`.
9696

97+
### Supported preprocessing steps
98+
99+
The pipeline is intentionally modular so you can mix and match steps to
100+
mirror popular `AutoML` defaults:
101+
102+
- **Scaling** – standard, min–max, or robust scaling via `ScaleParams`.
103+
- **Imputation** – mean, median, or most-frequent replacement with
104+
`ImputeParams`.
105+
- **Categorical encoders** – ordinal or one-hot encoding with optional dummy
106+
drop.
107+
- **Power transforms** – per-column log or Box-Cox transforms with automatic
108+
shifting for strictly positive domains.
109+
- **Column filters** – select or exclude features using `ColumnSelector`
110+
helpers.
111+
112+
Each state stores the fitted statistics (e.g., medians, category mappings) so
113+
that the same transformation can be applied consistently during inference.
114+
115+
### Example: AutoGluon-style defaults
116+
117+
```rust, no_run
118+
use automl::settings::{
119+
CategoricalEncoderParams, CategoricalEncoding, ColumnFilterParams, ColumnSelector,
120+
ImputeParams, ImputeStrategy, PowerTransform, PowerTransformParams, PreprocessingPipeline,
121+
PreprocessingStep, RobustScaleParams, ScaleParams, ScaleStrategy,
122+
};
123+
124+
let pipeline = PreprocessingPipeline::new()
125+
// Fill numeric columns with the median before scaling.
126+
.add_step(PreprocessingStep::Impute(ImputeParams {
127+
strategy: ImputeStrategy::Median,
128+
selector: ColumnSelector::Include(vec![0, 1, 2]),
129+
}))
130+
// Apply a robust scaler to guard against outliers (similar to AutoGluon).
131+
.add_step(PreprocessingStep::Scale(ScaleParams {
132+
strategy: ScaleStrategy::Robust(RobustScaleParams::default()),
133+
selector: ColumnSelector::Include(vec![0, 1, 2]),
134+
}))
135+
// One-hot encode categorical columns and drop the reference level.
136+
.add_step(PreprocessingStep::EncodeCategorical(CategoricalEncoderParams {
137+
selector: ColumnSelector::Include(vec![3, 4]),
138+
encoding: CategoricalEncoding::one_hot(true),
139+
}))
140+
// Optionally keep only the engineered features.
141+
.add_step(PreprocessingStep::FilterColumns(ColumnFilterParams {
142+
selector: ColumnSelector::Include(vec![0, 1, 2, 5, 6]),
143+
retain_selected: true,
144+
}));
145+
```
146+
147+
### Example: caret-style log + standardization recipe
148+
149+
```rust, no_run
150+
use automl::settings::{
151+
ColumnSelector, PowerTransform, PowerTransformParams, PreprocessingPipeline,
152+
PreprocessingStep, ScaleParams, ScaleStrategy, StandardizeParams,
153+
};
154+
155+
let caret_like = PreprocessingPipeline::new()
156+
.add_step(PreprocessingStep::PowerTransform(PowerTransformParams {
157+
selector: ColumnSelector::Include(vec![0]),
158+
transform: PowerTransform::Log { offset: 0.0 },
159+
}))
160+
.add_step(PreprocessingStep::Scale(ScaleParams {
161+
strategy: ScaleStrategy::Standard(StandardizeParams::default()),
162+
selector: ColumnSelector::All,
163+
}));
164+
```
165+
97166
## Features
98167

99168
This crate has several features that add some additional methods.

src/model/mod.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ pub mod classification;
44
pub mod clustering;
55
mod comparison;
66
pub mod error;
7-
mod preprocessing;
7+
pub mod preprocessing;
88
pub mod regression;
99
pub mod supervised;
1010

0 commit comments

Comments
 (0)