mlr-archive · mb706 · Aug 24, 2017 · Sep 6, 2017 · Sep 6, 2017 · Sep 6, 2017
diff --git a/devel/html/cpo/index.html b/devel/html/cpo/index.html
diff --git a/devel/html/fonts/FontAwesome.otf b/devel/html/fonts/FontAwesome.otf
diff --git a/devel/html/fonts/fontawesome-webfont.eot b/devel/html/fonts/fontawesome-webfont.eot
diff --git a/devel/html/fonts/fontawesome-webfont.svg b/devel/html/fonts/fontawesome-webfont.svg
diff --git a/devel/html/fonts/fontawesome-webfont.ttf b/devel/html/fonts/fontawesome-webfont.ttf
diff --git a/devel/html/fonts/fontawesome-webfont.woff b/devel/html/fonts/fontawesome-webfont.woff
diff --git a/devel/html/fonts/fontawesome-webfont.woff2 b/devel/html/fonts/fontawesome-webfont.woff2
diff --git a/devel/html/js/highlight.pack.js b/devel/html/js/highlight.pack.js
diff --git a/devel/html/js/jquery-1.10.2.min.js b/devel/html/js/jquery-1.10.2.min.js
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -27,6 +27,7 @@ pages:
 - Advanced:
     - 'Configuration': 'configureMlr.md'
     - 'Wrapped Learners': 'wrapper.md'
+    - 'Preprocessing Operators (CPO)': 'cpo.md'
     - 'Imputation': 'impute.md'
     - 'Bagging': 'bagging.md'
     - 'Advanced Tuning': 'advanced_tune.md'

diff --git a/r_packs_install.R b/r_packs_install.R
@@ -2,8 +2,9 @@ if (!require(devtools)) {
   install.packages("devtools")
   library(devtools)
 }
+install_github("berndbischl/ParamHelpers", ref = "paramSetSugar")
 install_github("mlr-org/mlr")
-
+install_github("mlr-org/mlrCPO")
 print("LIBPATHS:")
 print(.libPaths())
 

diff --git a/src/cpo.Rmd b/src/cpo.Rmd
@@ -0,0 +1,278 @@
+# Preprocessing Operators
+
+Data preprocessing can be performed using the "[%mlrCPO]" ("Composable Preprocessing Operators") addon package for [%mlr].
+[%mlrCPO] makes it easy to use a variety of preprocessing operations, to chain different operations, to integrate
+preprocessing with mlr [&Learner]s, and to define custom preprocessing operations.
+
+[%mlrCPO] provides the `%>>%`-operator, which is used as a piping operator: It chains different operations,
+it applies an operation to a dataset,
+and it attaches an operation to a [&Learner] to create an integrated preprocessing and model fitting pipeline.
+This way, it is possible to quickly create natural looking pipelines that are very flexible and can even be
+[tuned](tune.md) over.
+
+This tutorial handles the basics of using [%mlrCPO] for preprocessing in combination with mlr [Learner](learner.md)s. For a more in-depth introduction, look at the
+[%mlrCPO] vignette using
+```{r, eval = FALSE}
+vignette("a_1_getting_started", package = "mlrCPO")
+```
+
+The following requires the [%mlrCPO] package to be loaded:
+```{r}
+library("mlrCPO")
+```
+
+## CPO Objects
+
+Different preprocessing operations are provided in the form of **[CPO Constructors](&mlrCPO::CPOConstructor)**,
+which can be called like functions
+to create **[CPO](&mlrCPO::CPO)** objects. These [CPO](&mlrCPO::CPO) objects are then used to apply the operation to a data set.
+```{r}
+cpoAddCols  # a cpo constructor
+```
+```{r}
+# create a CPO object that adds a new column
+cpo = cpoAddCols(Sepal.Area = Sepal.Length * Sepal.Width) 
+```
+
+[CPO](&mlrCPO::CPO) objects are central to [%mlrCPO], and they are very flexible. They can be applied to a
+[data.frame](&base::data.frame) or a [&Task]:
+```{r}
+head(iris %>>% cpo)
+```
+```{r}
+head(getTaskData(iris.task %>>% cpo))
+```
+
+[CPO](&mlrCPO::CPO)s can be [concatenated](&mlrCPO::composeCPO) to create new operations. The following example adds the `Sepal.Area` column and then scales
+and centers all numeric columns:
+```{r}
+cpo %>>% cpoScale()
+```
+
+[CPO](&mlrCPO::CPO)s can be [fused with a Learner](&mlrCPO::attachCPO) to create a machine learning pipeline that performs
+preprocessing on the training data
+and also pre-processes the data that is fed to the resulting model for prediction.
+```{r}
+lrn = cpo %>>% makeLearner("classif.randomForest")
+model = train(lrn, iris.task)
+getFeatureImportance(model$learner.model$next.model)
+```
+
+A list of all internal [CPO](&mlrCPO::CPO)s can be retrieved using [`listCPO()`](&mlrCPO::listCPO), which returns a [data.frame](&base::data.frame) of names, categories, and descriptions.
+```{r}
+listCPO()
+```
+
+## Hyperparameters
+
+[CPO](&mlrCPO::CPO) objects have hyperparameters that can be adjusted at creation, or later using [`setHyperPars()`](&setHyperPars). They are shown by the
+[CPO Constructor](&mlrCPO::CPOConstructor) representation when printed, and can be given as parameters during construction.
+```{r}
+cpoScale
+```
+```{r}
+do.center = cpoScale(scale = FALSE, center = TRUE)
+```
+
+The [`ParamSet`](&ParamHelpers::ParamSet) of a [CPO](&mlrCPO::CPO) can be inspected using [`getParamSet()`](&getParamSet), but it is also shown when verbosely printing a [CPO](&mlrCPO::CPO) using `!`.
+```{r}
+!do.center  # note the 'scale.' prefix
+```
+```{r}
+do.scale = setHyperPars(do.center,
+  scale.scale = TRUE, scale.center = FALSE)
+do.scale
+```
+
+These hyperparameters even survive CPO [composition](&mlrCPO::composeCPO) and [attachment](&mlrCPO::attachCPO) to [&Learner]s:
+```{r}
+cpo = cpoScale() %>>% cpoPca()
+lrn = cpo %>>% makeLearner("classif.logreg")
+print(lrn)
+```
+
+When composing many [CPO](&mlrCPO::CPO)s, the [`ParamSet`](&ParamHelpers::ParamSet) of the combined [CPO](&mlrCPO::CPO) can become quite cluttered. To prevent name clashes, it is possible
+to change the prefix of the hyperparameters of a given [CPO](&mlrCPO::CPO) using the *[ID](&mlrCPO::getCPOId)*. It can be set during construction, or by using [`setCPOId()`](&mlrCPO::setCPOId).
+
+```{r}
+combined = cpoScale(scale = TRUE, center = FALSE, id = "scale") %>>%
+  cpoScale(scale = FALSE, center = TRUE, id = "center")
+getParamSet(combined)
+```
+
+Another possibility is to change what parameters are "exported" by the [CPO](&mlrCPO::CPO). A parameter that is not exported can not be changed
+after construction. The `export` parameter given during construction can be a [character](&base::character) vector of the parameters to export.
+```{r}
+center = cpoScale(scale = FALSE, center = TRUE, export = "center")
+!center
+```
+
+## Affecting Only Some Features
+
+It is possible to set up a [CPO](&mlrCPO::CPO) so that it only affects certain columns of a given dataset. This is done with a few
+parameters during construction that begin with the prefix "`affect.`". The following example only scales and centers columns
+that begin with "Sepal".
+
+```{r}
+cpo = cpoScale(affect.pattern = "^Sepal")
+head(iris %>>% cpo)
+```
+
+## `CPOTrained`: Retrafo and Inverter
+
+Manipulating data for preprocessing itself is relatively easy. A challenge comes when one wants to integrate preprocessing
+into a machine-learning pipeline: The same preprocessing steps that are performed on the [training data](train.md)
+need to be performed on the new [prediction data](predict.md). However, the transformation performed for prediction often needs
+information from the training step.
+For example, if training entails performing [PCA](&mlrCPO::cpoPca),
+then for prediction, the data must not undergo another PCA, instead it needs
+to be rotated by the rotation matrix found by the training PCA. The process of obtaining the rotation matrix is called
+"training" the [CPO](&mlrCPO::CPO), and the object that contains the trained information is a **[`CPOTrained`](&mlrCPO::CPOTrained)** object; it can be accessed using
+the [`retrafo()`](&mlrCPO::retrafo) function on the transformed data. When a [CPO](&mlrCPO::CPO) has an effect
+on the *target* columns of a Task, two [`CPOTrained`](&mlrCPO::CPOTrained) objects are generated: One, as before, is used on new prediction data before
+doing predictoin with a model. The other is used on predictions made with that model, to map the prediction back to the space
+of the original target column. This inverting [`CPOTrained`](&mlrCPO::CPOTrained) can be accessed using [`inverter()`](&mlrCPO::inverter) on transformed data.
+
+The process of using [`CPOTrained`](&mlrCPO::CPOTrained) correctly can be a bit involved, but [%mlrCPO] automates it when a [CPO](&mlrCPO::CPO) is attached to a
+[&Learner] object, see the [following section](#cpo-learner). The [`CPOTrained`](&mlrCPO::CPOTrained) objects are explained in more detail in the mlrCP vignette.
+
+## CPO Learner
+
+When attaching a [CPO](&mlrCPO::CPO) to a Learner using the `%>>%`-operator, the complete preprocessing pipeline is integrated by [%mlrCPO], so there is no need to
+worry about keeping [`CPOTrained`](&mlrCPO::CPOTrained) objects. The resulting **[`CPOLearner`](&mlrCPO::CPOLearner)** inherits the hyperparameters both from the [CPO](&mlrCPO::CPO) *and* the [&Learner]. This way,
+the function of a [CPO](&mlrCPO::CPO) can be *[tuned](&tune.md)* together with parameters of a [&Learner] itself.
+
+When a [`CPOLearner`](&mlrCPO::CPOLearner) is trained on some data, it is possible to get information about the effect of an attached [CPO](&mlrCPO::CPO) by
+inspecting the [`CPOTrained`](&mlrCPO::CPOTrained) object created during training. It can be retrieved from a model using [`retrafo()`](&mlrCPO::retrafo) and inspected
+using [`getCPOTrainedState()`](&mlrCPO::getCPOTrainedState). The following example retrieves the PCA rotation matrix trained when fitting a [`CPOLearner`](&mlrCPO::CPOLearner) to [`iris.task`](&iris.task).
+
+```{r}
+lrn = cpoPca() %>>% makeLearner("classif.randomForest")
+model = train(lrn, iris.task)
+
+retr = retrafo(model)
+state = getCPOTrainedState(retr)
+state$control$rotation
+```
+
+## Tuning
+Tuning [CPO](&mlrCPO::CPO) hyperparameters works exactly like [tuning Learner hyperparameters](tune.md), since the [CPO](&mlrCPO::CPO)'s parameters are attached naturally to a [&Learner]'s parameters when a [`CPOLearner`](&mlrCPO::CPOLearner)
+is formed.
+
+```{r}
+(clrn = cpoFilterFeatures(export = c("method", "abs")) %>>% makeLearner("classif.knn"))
+```
+```{r}
+getParamIds(getParamSet(clrn))
+```
+```{r}
+ps = makeParamSet(
+    makeDiscreteParam(
+        "filterFeatures.method",
+        values = list("anova.test", "variance", "chi.squared")),
+    makeIntegerParam(
+        "filterFeatures.abs",
+        lower = 1, upper = 8),
+    makeIntegerParam(
+        "k",
+        lower = 1, upper = 10))
+
+tuneParams(clrn, pid.task, cv5, par.set = ps,
+           control = makeTuneControlRandom(budget = 10),
+           show.info=FALSE)
+```
+
+## Special CPOs
+
+### NULLCPO
+
+Under certain circumstances it can be useful to represent the operation of *no preprocessing*. This is done using the [`NULLCPO`](&mlrCPO::NULLCPO) object. If it is [applied](&mlrCPO::applyCPO) to data, [attached](&mlrCPO::attachCPO) to a [&Learner] or [composed](&mlrCPO::composeCPO) with another [CPO](&mlrCPO::CPO), the result is not modified.
+
+```{r}
+identical(iris %>>% NULLCPO, iris)
+identical(cpoPca() %>>% NULLCPO, cpoPca())
+identical(NULLCPO %>>% makeLearner("classif.logreg"), makeLearner("classif.logreg"))
+```
+
+### CPO Multiplexer
+The multiplexer makes it possible to combine many [CPO](&mlrCPO::CPO)s into one, with an extra `selected.cpo` parameter that chooses between them.
+
+```{r}
+cpm = cpoMultiplex(list(cpoScale, cpoPca))
+!cpm
+```
+```{r}
+head(iris %>>% setHyperPars(cpm, selected.cpo = "scale"))
+```
+```{r}
+head(iris %>>% setHyperPars(cpm, selected.cpo = "pca"))
+```
+
+Every [CPO](&mlrCPO::CPO)'s Hyperparameters are exported:
+```{r}
+head(iris %>>% setHyperPars(cpm, selected.cpo = "scale", scale.center = FALSE))
+```
+
+This makes it possible to [tune](tune.md) over many different [CPO](&mlrCPO::CPO) configurations at once.
+
+### CBind CPO
+The operation of using [`cbind`](&base::cbind) on the result of multiple [CPO](&mlrCPO::CPO)s. [`cpoCbind`](&mlrCPO::cpoCbind) makes it possible to build [CPO](&mlrCPO::CPO)s that perform different operations on data and paste the results next to each other.
+
+```{r}
+cbnd = cpoCbind(scaled = cpoScale(), pca = cpoPca())
+head(iris %>>% cbnd)
+```
+It is even possible to build complex DAGs of preprocessing operators. In the following example, [`cpoCbind`](&mlrCPO::cpoCbind) recognizes that [`cpoFilterVariance`](&mlrCPO::cpoFilterVariance) comes
+before both [`cpoScale`](&mlrCPO::cpoScale) *and* [`cpoPca`](&mlrCPO::cpoPca) and performs filtering only once.
+The original data is pasted next to the scaled and PCA'd data by having a [`NULLCPO`](&mlrCPO::NULLCPO) slot
+which does not change any data.
+```{r}
+flt = cpoFilterVariance(abs = 2, export = "abs")
+cbnd = cpoCbind(scale = flt %>>% cpoScale(), pca = flt %>>% cpoPca(), NULLCPO)
+head(getTaskData(iris.task %>>% cbnd))
+```
+The order of operations can be inspected in a crude ASCII graph when looking at the verbose printout of `cbnd`. The output of `variance` is fed into both `pca` and `scale`.
+```{r}
+!cbnd
+```
+The parameters of the internal [CPO](&mlrCPO::CPO)s are exported and can be manipulated and [tuned](tune.md).
+```{r}
+getParamSet(cbnd)
+```
+
+## Custom CPOs
+
+Even though [CPO](&mlrCPO::CPO)s are very flexible and can be combined in many ways, it may be necessary to create completely custom CPOs.
+Custom CPOs can be created using the [`makeCPO()`](&mlrCPO::makeCPO) function (and similar related functions).
+Its most important arguments are `cpo.train` and `cpo.retrafo`, both of which are functions.
+In principle, a [CPO](&mlrCPO::CPO) needs a function that "trains" a control object depending on the data (`cpo.train`),
+and another function that uses this control object, and new data, to perform the preprocessing operation (`cpo.retrafo`).
+The `cpo.train`-function must return a "control" object which contains all information about how to transform a given dataset.
+`cpo.retrafo` takes a (potentially new!) dataset *and* the "control" object returned by `cpo.trafo`, and transforms the new data according to plan.
+See [%mlrCPO] vignettes or [`help(makeCPO)`](&mlrCPO::makeCPO) for a more thorough description of how to create custom CPOs.
+```{r}
+names(formals(makeCPO))  # see help(makeCPO) for explanation of arguments
+```
+
+```{r}
+constFeatRem = makeCPO("constFeatRem",
+  dataformat = "df.features",
+  cpo.train = function(data, target) {
+    names(Filter(function(x) {  # names of columns to keep
+      length(unique(x)) > 1
+    }, data))
+  },
+  cpo.retrafo = function(data, control) {
+    data[control]
+  })
+
+!constFeatRem
+```
+This [CPO](&mlrCPO::CPO) can be used on the [`head()`](&utils::head) of the [`iris`](&datasets::iris) dataset. Since the "Species" entry for the first six rows of [`iris`](&datasets::iris) is constant, it is removed
+by this [CPO](&mlrCPO::CPO).
+```{r}
+head(iris)
+```
+```{r}
+head(iris) %>>% constFeatRem()
+```