diff --git a/src/use_case_classification.Rmd b/src/use_case_classification.Rmd new file mode 100644 index 00000000..2ca33fbb --- /dev/null +++ b/src/use_case_classification.Rmd @@ -0,0 +1,152 @@ +# Classification: Predicting Survival on the Titanic + +One popular example regarding a classification task is the Titanic showcase. We have different passenger information - like name, age or fare - available with the aim to predict which kind of people would have survived the Titanic sinking. + +Therefore we load the titanic dataset and other packages that are needed for this use case. + +```{r, warning=FALSE, message=FALSE} +library(titanic) +library(mlr) +library(BBmisc) +``` + + +```{r} +data = titanic_train +head(data) +``` + +Our aim - as mentioned before - is to predict which kind of people would have survived. + +Therefore we will work off the following steps: + +* preprocessing, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/preproc/index.html) and [here](http://mlr-org.github.io/mlr-tutorial/devel/html/impute/index.html) +* define a learning task,[here](http://mlr-org.github.io/mlr-tutorial/devel/html/task/index.html) +* select a learning method, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/learner/index.html) +* train the model, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/train/index.html) +* predict the survival chance, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/predict/index.html) +* validate the model,[here](http://mlr-org.github.io/mlr-tutorial/devel/html/performance/index.html) and [here](http://mlr-org.github.io/mlr-tutorial/devel/html/resample/index.html) + +#### Preprocessing + +The data set is corrected regarding their data types. + +```{r} +data[, c("Survived", "Pclass", "Sex", "SibSp", "Embarked")] = lapply(data[, c("Survived", "Pclass", "Sex", "SibSp", "Embarked")], as.factor) +``` + +Next, useless columns will be dropped. + +```{r} +data = dropNamed(data, c("Cabin","PassengerId", "Ticket", "Name")) +``` + +And missing values will be imputed. Age and Fare are numerival variables and the missing values will be replaced through the median. Embarked is a character data type and its missing values will be imputed with the mode. + +```{r} +data$Embarked[data$Embarked == ""] = NA +data$Embarked = droplevels(data$Embarked) +data = impute(data, cols = list(Age = imputeMedian(), Fare = imputeMedian(), Embarked = imputeMode())) +data = data$data +``` + +### Create a task + +Let's first define our learning problem. We therefore need to specify the data and the name of the target column we are going to predict. People who survived are labelled with "1" and thus we added positive = "1". + +```{r} +task = makeClassifTask(data = data, target = "Survived", positive = "1") +``` + +### Define a learner + +A classification learner is selected. You can find an overview of all learners [here](http://mlr-org.github.io/mlr-tutorial/devel/html/integrated_learners/index.html). + +In this showcase we have a classification problem, so we need to select a learner for this kind of question. As this method often works well off the shelf, we use a random forest and set it up to predict class probabilities additional to class labels. + +```{r} +lrn = makeLearner("classif.randomForest", predict.type = "prob") +``` + +### Fit the model + +To fit the model - and afterwards predict - the data set is split into a training and a test data set. + +```{r} +n = getTaskSize(task) +trainSet = seq(1, n, by = 2) +testSet = seq(2, n, by = 2) +``` + +```{r} +mod = train(learner = lrn, task = task, subset = trainSet) +``` + +### Predict + +To predict the survival chance we use the predict function and provide the fitted model, the task and also the test set that we have generated. + +```{r} +pred = predict(mod, task, subset = testSet) +``` + +The quality of the predictions of a model in mlr can be assessed with respect to a number of different performance measures [here](http://mlr-org.github.io/mlr-tutorial/devel/html/measures/index.html). + +For measuring the performance of a classification problem often the measures accuracy, false-positive-rate and true-positive-rate are used. + +```{r} +performance(pred, measures = list(acc, fpr, tpr)) +``` + +Another way to check the quality is to visualize the performance measures, which can be done with generation functions. To learn more about visualization you can view [this](http://mlr-org.github.io/mlr-tutorial/devel/html/visualization/index.html) page. + +```{r} +df = generateThreshVsPerfData(pred, list(fpr, tpr, acc)) +plotThreshVsPerf(df) +plotROCCurves(df) +``` +### Make predictions for a new data set + +As you might have seen the titanic library also provides a second dataset. + +```{r} +test = titanic_test +head(test) +``` + +This one does not contain any survival information, but we now can use our fitted model and predict the survival probability for this data set. + +The same preprocessing steps - as for the "data" data set - have to be applied + +```{r} +test[, c("Pclass", "Sex", "SibSp", "Embarked")] = lapply(test[, c("Pclass", "Sex", "SibSp", "Embarked")], as.factor) + +test = dropNamed(test, c("Cabin","PassengerId", "Ticket", "Name")) + +test = impute(test, cols = list(Age = imputeMedian(), Fare = imputeMedian())) +test = test$data + +summarizeColumns(test) +``` + +You can use the task and learner that you have already created. + +```{r} +task +lrn +``` + +The training step will be different now. We don't use a subset to fit the model, but use all data. + +```{r} +mod = train(learner = lrn, task = task) +``` + +For the prediction part, we will use the new test data set. + +```{r} +pred = predict(mod, newdata = test) +pred +``` + +