Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
844 changes: 844 additions & 0 deletions devel/html/cpo/index.html

Large diffs are not rendered by default.

Binary file added devel/html/fonts/FontAwesome.otf
Binary file not shown.
Binary file removed devel/html/fonts/fontawesome-webfont.eot
Binary file not shown.
3,320 changes: 2,668 additions & 652 deletions devel/html/fonts/fontawesome-webfont.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified devel/html/fonts/fontawesome-webfont.ttf
Binary file not shown.
Binary file modified devel/html/fonts/fontawesome-webfont.woff
Binary file not shown.
Binary file removed devel/html/fonts/fontawesome-webfont.woff2
Binary file not shown.
7 changes: 5 additions & 2 deletions devel/html/js/highlight.pack.js

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions devel/html/js/jquery-1.10.2.min.js

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ pages:
- Advanced:
- 'Configuration': 'configureMlr.md'
- 'Wrapped Learners': 'wrapper.md'
- 'Preprocessing Operators (CPO)': 'cpo.md'
- 'Imputation': 'impute.md'
- 'Bagging': 'bagging.md'
- 'Advanced Tuning': 'advanced_tune.md'
Expand Down
3 changes: 2 additions & 1 deletion r_packs_install.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@ if (!require(devtools)) {
install.packages("devtools")
library(devtools)
}
install_github("berndbischl/ParamHelpers", ref = "paramSetSugar")
install_github("mlr-org/mlr")

install_github("mlr-org/mlrCPO")
print("LIBPATHS:")
print(.libPaths())

Expand Down
278 changes: 278 additions & 0 deletions src/cpo.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@
# Preprocessing Operators

Data preprocessing can be performed using the "[%mlrCPO]" ("Composable Preprocessing Operators") addon package for [%mlr].
[%mlrCPO] makes it easy to use a variety of preprocessing operations, to chain different operations, to integrate
preprocessing with mlr [&Learner]s, and to define custom preprocessing operations.

[%mlrCPO] provides the `%>>%`-operator, which is used as a piping operator: It chains different operations,
it applies an operation to a dataset,
and it attaches an operation to a [&Learner] to create an integrated preprocessing and model fitting pipeline.
This way, it is possible to quickly create natural looking pipelines that are very flexible and can even be
[tuned](tune.md) over.

This tutorial handles the basics of using [%mlrCPO] for preprocessing in combination with mlr [Learner](learner.md)s. For a more in-depth introduction, look at the
[%mlrCPO] vignette using
```{r, eval = FALSE}
vignette("a_1_getting_started", package = "mlrCPO")
```

The following requires the [%mlrCPO] package to be loaded:
```{r}
library("mlrCPO")
```

## CPO Objects

Different preprocessing operations are provided in the form of **[CPO Constructors](&mlrCPO::CPOConstructor)**,
which can be called like functions
to create **[CPO](&mlrCPO::CPO)** objects. These [CPO](&mlrCPO::CPO) objects are then used to apply the operation to a data set.
```{r}
cpoAddCols # a cpo constructor
```
```{r}
# create a CPO object that adds a new column
cpo = cpoAddCols(Sepal.Area = Sepal.Length * Sepal.Width)
```

[CPO](&mlrCPO::CPO) objects are central to [%mlrCPO], and they are very flexible. They can be applied to a
[data.frame](&base::data.frame) or a [&Task]:
```{r}
head(iris %>>% cpo)
```
```{r}
head(getTaskData(iris.task %>>% cpo))
```

[CPO](&mlrCPO::CPO)s can be [concatenated](&mlrCPO::composeCPO) to create new operations. The following example adds the `Sepal.Area` column and then scales
and centers all numeric columns:
```{r}
cpo %>>% cpoScale()
```

[CPO](&mlrCPO::CPO)s can be [fused with a Learner](&mlrCPO::attachCPO) to create a machine learning pipeline that performs
preprocessing on the training data
and also pre-processes the data that is fed to the resulting model for prediction.
```{r}
lrn = cpo %>>% makeLearner("classif.randomForest")
model = train(lrn, iris.task)
getFeatureImportance(model$learner.model$next.model)
```

A list of all internal [CPO](&mlrCPO::CPO)s can be retrieved using [`listCPO()`](&mlrCPO::listCPO), which returns a [data.frame](&base::data.frame) of names, categories, and descriptions.
```{r}
listCPO()
```

## Hyperparameters

[CPO](&mlrCPO::CPO) objects have hyperparameters that can be adjusted at creation, or later using [`setHyperPars()`](&setHyperPars). They are shown by the
[CPO Constructor](&mlrCPO::CPOConstructor) representation when printed, and can be given as parameters during construction.
```{r}
cpoScale
```
```{r}
do.center = cpoScale(scale = FALSE, center = TRUE)
```

The [`ParamSet`](&ParamHelpers::ParamSet) of a [CPO](&mlrCPO::CPO) can be inspected using [`getParamSet()`](&getParamSet), but it is also shown when verbosely printing a [CPO](&mlrCPO::CPO) using `!`.
```{r}
!do.center # note the 'scale.' prefix
```
```{r}
do.scale = setHyperPars(do.center,
scale.scale = TRUE, scale.center = FALSE)
do.scale
```

These hyperparameters even survive CPO [composition](&mlrCPO::composeCPO) and [attachment](&mlrCPO::attachCPO) to [&Learner]s:
```{r}
cpo = cpoScale() %>>% cpoPca()
lrn = cpo %>>% makeLearner("classif.logreg")
print(lrn)
```

When composing many [CPO](&mlrCPO::CPO)s, the [`ParamSet`](&ParamHelpers::ParamSet) of the combined [CPO](&mlrCPO::CPO) can become quite cluttered. To prevent name clashes, it is possible
to change the prefix of the hyperparameters of a given [CPO](&mlrCPO::CPO) using the *[ID](&mlrCPO::getCPOId)*. It can be set during construction, or by using [`setCPOId()`](&mlrCPO::setCPOId).

```{r}
combined = cpoScale(scale = TRUE, center = FALSE, id = "scale") %>>%
cpoScale(scale = FALSE, center = TRUE, id = "center")
getParamSet(combined)
```

Another possibility is to change what parameters are "exported" by the [CPO](&mlrCPO::CPO). A parameter that is not exported can not be changed
after construction. The `export` parameter given during construction can be a [character](&base::character) vector of the parameters to export.
```{r}
center = cpoScale(scale = FALSE, center = TRUE, export = "center")
!center
```

## Affecting Only Some Features

It is possible to set up a [CPO](&mlrCPO::CPO) so that it only affects certain columns of a given dataset. This is done with a few
parameters during construction that begin with the prefix "`affect.`". The following example only scales and centers columns
that begin with "Sepal".

```{r}
cpo = cpoScale(affect.pattern = "^Sepal")
head(iris %>>% cpo)
```

## `CPOTrained`: Retrafo and Inverter

Manipulating data for preprocessing itself is relatively easy. A challenge comes when one wants to integrate preprocessing
into a machine-learning pipeline: The same preprocessing steps that are performed on the [training data](train.md)
need to be performed on the new [prediction data](predict.md). However, the transformation performed for prediction often needs
information from the training step.
For example, if training entails performing [PCA](&mlrCPO::cpoPca),
then for prediction, the data must not undergo another PCA, instead it needs
to be rotated by the rotation matrix found by the training PCA. The process of obtaining the rotation matrix is called
"training" the [CPO](&mlrCPO::CPO), and the object that contains the trained information is a **[`CPOTrained`](&mlrCPO::CPOTrained)** object; it can be accessed using
the [`retrafo()`](&mlrCPO::retrafo) function on the transformed data. When a [CPO](&mlrCPO::CPO) has an effect
on the *target* columns of a Task, two [`CPOTrained`](&mlrCPO::CPOTrained) objects are generated: One, as before, is used on new prediction data before
doing predictoin with a model. The other is used on predictions made with that model, to map the prediction back to the space
of the original target column. This inverting [`CPOTrained`](&mlrCPO::CPOTrained) can be accessed using [`inverter()`](&mlrCPO::inverter) on transformed data.

The process of using [`CPOTrained`](&mlrCPO::CPOTrained) correctly can be a bit involved, but [%mlrCPO] automates it when a [CPO](&mlrCPO::CPO) is attached to a
[&Learner] object, see the [following section](#cpo-learner). The [`CPOTrained`](&mlrCPO::CPOTrained) objects are explained in more detail in the mlrCP vignette.

## CPO Learner

When attaching a [CPO](&mlrCPO::CPO) to a Learner using the `%>>%`-operator, the complete preprocessing pipeline is integrated by [%mlrCPO], so there is no need to
worry about keeping [`CPOTrained`](&mlrCPO::CPOTrained) objects. The resulting **[`CPOLearner`](&mlrCPO::CPOLearner)** inherits the hyperparameters both from the [CPO](&mlrCPO::CPO) *and* the [&Learner]. This way,
the function of a [CPO](&mlrCPO::CPO) can be *[tuned](&tune.md)* together with parameters of a [&Learner] itself.

When a [`CPOLearner`](&mlrCPO::CPOLearner) is trained on some data, it is possible to get information about the effect of an attached [CPO](&mlrCPO::CPO) by
inspecting the [`CPOTrained`](&mlrCPO::CPOTrained) object created during training. It can be retrieved from a model using [`retrafo()`](&mlrCPO::retrafo) and inspected
using [`getCPOTrainedState()`](&mlrCPO::getCPOTrainedState). The following example retrieves the PCA rotation matrix trained when fitting a [`CPOLearner`](&mlrCPO::CPOLearner) to [`iris.task`](&iris.task).

```{r}
lrn = cpoPca() %>>% makeLearner("classif.randomForest")
model = train(lrn, iris.task)

retr = retrafo(model)
state = getCPOTrainedState(retr)
state$control$rotation
```

## Tuning
Tuning [CPO](&mlrCPO::CPO) hyperparameters works exactly like [tuning Learner hyperparameters](tune.md), since the [CPO](&mlrCPO::CPO)'s parameters are attached naturally to a [&Learner]'s parameters when a [`CPOLearner`](&mlrCPO::CPOLearner)
is formed.

```{r}
(clrn = cpoFilterFeatures(export = c("method", "abs")) %>>% makeLearner("classif.knn"))
```
```{r}
getParamIds(getParamSet(clrn))
```
```{r}
ps = makeParamSet(
makeDiscreteParam(
"filterFeatures.method",
values = list("anova.test", "variance", "chi.squared")),
makeIntegerParam(
"filterFeatures.abs",
lower = 1, upper = 8),
makeIntegerParam(
"k",
lower = 1, upper = 10))

tuneParams(clrn, pid.task, cv5, par.set = ps,
control = makeTuneControlRandom(budget = 10),
show.info=FALSE)
```

## Special CPOs

### NULLCPO

Under certain circumstances it can be useful to represent the operation of *no preprocessing*. This is done using the [`NULLCPO`](&mlrCPO::NULLCPO) object. If it is [applied](&mlrCPO::applyCPO) to data, [attached](&mlrCPO::attachCPO) to a [&Learner] or [composed](&mlrCPO::composeCPO) with another [CPO](&mlrCPO::CPO), the result is not modified.

```{r}
identical(iris %>>% NULLCPO, iris)
identical(cpoPca() %>>% NULLCPO, cpoPca())
identical(NULLCPO %>>% makeLearner("classif.logreg"), makeLearner("classif.logreg"))
```

### CPO Multiplexer
The multiplexer makes it possible to combine many [CPO](&mlrCPO::CPO)s into one, with an extra `selected.cpo` parameter that chooses between them.

```{r}
cpm = cpoMultiplex(list(cpoScale, cpoPca))
!cpm
```
```{r}
head(iris %>>% setHyperPars(cpm, selected.cpo = "scale"))
```
```{r}
head(iris %>>% setHyperPars(cpm, selected.cpo = "pca"))
```

Every [CPO](&mlrCPO::CPO)'s Hyperparameters are exported:
```{r}
head(iris %>>% setHyperPars(cpm, selected.cpo = "scale", scale.center = FALSE))
```

This makes it possible to [tune](tune.md) over many different [CPO](&mlrCPO::CPO) configurations at once.

### CBind CPO
The operation of using [`cbind`](&base::cbind) on the result of multiple [CPO](&mlrCPO::CPO)s. [`cpoCbind`](&mlrCPO::cpoCbind) makes it possible to build [CPO](&mlrCPO::CPO)s that perform different operations on data and paste the results next to each other.

```{r}
cbnd = cpoCbind(scaled = cpoScale(), pca = cpoPca())
head(iris %>>% cbnd)
```
It is even possible to build complex DAGs of preprocessing operators. In the following example, [`cpoCbind`](&mlrCPO::cpoCbind) recognizes that [`cpoFilterVariance`](&mlrCPO::cpoFilterVariance) comes
before both [`cpoScale`](&mlrCPO::cpoScale) *and* [`cpoPca`](&mlrCPO::cpoPca) and performs filtering only once.
The original data is pasted next to the scaled and PCA'd data by having a [`NULLCPO`](&mlrCPO::NULLCPO) slot
which does not change any data.
```{r}
flt = cpoFilterVariance(abs = 2, export = "abs")
cbnd = cpoCbind(scale = flt %>>% cpoScale(), pca = flt %>>% cpoPca(), NULLCPO)
head(getTaskData(iris.task %>>% cbnd))
```
The order of operations can be inspected in a crude ASCII graph when looking at the verbose printout of `cbnd`. The output of `variance` is fed into both `pca` and `scale`.
```{r}
!cbnd
```
The parameters of the internal [CPO](&mlrCPO::CPO)s are exported and can be manipulated and [tuned](tune.md).
```{r}
getParamSet(cbnd)
```

## Custom CPOs

Even though [CPO](&mlrCPO::CPO)s are very flexible and can be combined in many ways, it may be necessary to create completely custom CPOs.
Custom CPOs can be created using the [`makeCPO()`](&mlrCPO::makeCPO) function (and similar related functions).
Its most important arguments are `cpo.train` and `cpo.retrafo`, both of which are functions.
In principle, a [CPO](&mlrCPO::CPO) needs a function that "trains" a control object depending on the data (`cpo.train`),
and another function that uses this control object, and new data, to perform the preprocessing operation (`cpo.retrafo`).
The `cpo.train`-function must return a "control" object which contains all information about how to transform a given dataset.
`cpo.retrafo` takes a (potentially new!) dataset *and* the "control" object returned by `cpo.trafo`, and transforms the new data according to plan.
See [%mlrCPO] vignettes or [`help(makeCPO)`](&mlrCPO::makeCPO) for a more thorough description of how to create custom CPOs.
```{r}
names(formals(makeCPO)) # see help(makeCPO) for explanation of arguments
```

```{r}
constFeatRem = makeCPO("constFeatRem",
dataformat = "df.features",
cpo.train = function(data, target) {
names(Filter(function(x) { # names of columns to keep
length(unique(x)) > 1
}, data))
},
cpo.retrafo = function(data, control) {
data[control]
})

!constFeatRem
```
This [CPO](&mlrCPO::CPO) can be used on the [`head()`](&utils::head) of the [`iris`](&datasets::iris) dataset. Since the "Species" entry for the first six rows of [`iris`](&datasets::iris) is constant, it is removed
by this [CPO](&mlrCPO::CPO).
```{r}
head(iris)
```
```{r}
head(iris) %>>% constFeatRem()
```