diff --git a/python_scripts/parameter_tuning_grid_search.py b/python_scripts/parameter_tuning_grid_search.py index 97778efc6..cb908e4b2 100644 --- a/python_scripts/parameter_tuning_grid_search.py +++ b/python_scripts/parameter_tuning_grid_search.py @@ -116,77 +116,92 @@ # %% [markdown] # ## Tuning using a grid-search # -# In the previous exercise we used one `for` loop for each hyperparameter to -# find the best combination over a fixed grid of values. `GridSearchCV` is a -# scikit-learn class that implements a very similar logic with less repetitive -# code. +# In the previous exercise (M3.01) we used two nested `for` loops (one for each +# hyperparameter) to test different combinations over a fixed grid of +# hyperparameter values. In each iteration of the loop, we used +# `cross_val_score` to compute the mean score (as averaged across +# cross-validation splits), and compared those mean scores to select the best +# combination. `GridSearchCV` is a scikit-learn class that implements a very +# similar logic with less repetitive code. The suffix `CV` refers to the +# cross-validation it runs internally (instead of the `cross_val_score` we +# "hard" coded). # -# Let's see how to use the `GridSearchCV` estimator for doing such search. Since -# the grid-search is costly, we only explore the combination learning-rate and -# the maximum number of nodes. +# The `GridSearchCV` estimator takes a `param_grid` parameter which defines all +# hyperparameters and their associated values. The grid-search is in charge of +# creating all possible combinations and testing them. +# +# The number of combinations is equal to the product of the number of values to +# explore for each parameter. Thus, adding new parameters with their associated +# values to be explored rapidly becomes computationally expensive. Because of +# that, here we only explore the combination learning-rate and the maximum +# number of nodes for a total of 4 x 3 = 12 combinations. -# %% # %%time from sklearn.model_selection import GridSearchCV param_grid = { - "classifier__learning_rate": (0.01, 0.1, 1, 10), - "classifier__max_leaf_nodes": (3, 10, 30), -} + "classifier__learning_rate": (0.01, 0.1, 1, 10), # 4 possible values + "classifier__max_leaf_nodes": (3, 10, 30), # 3 possible values +} # 12 unique combinations model_grid_search = GridSearchCV(model, param_grid=param_grid, n_jobs=2, cv=2) model_grid_search.fit(data_train, target_train) # %% [markdown] -# Finally, we check the accuracy of our model using the test set. +# You can access the best combination of hyperparameters found by the grid +# search using the `best_params_` attribute. # %% -accuracy = model_grid_search.score(data_test, target_test) -print( - f"The test accuracy score of the grid-searched pipeline is: {accuracy:.2f}" -) - -# %% [markdown] -# ```{warning} -# Be aware that the evaluation should normally be performed through -# cross-validation by providing `model_grid_search` as a model to the -# `cross_validate` function. -# -# Here, we used a single train-test split to evaluate `model_grid_search`. In -# a future notebook will go into more detail about nested cross-validation, when -# you use cross-validation both for hyperparameter tuning and model evaluation. -# ``` +print(f"The best set of parameters is: {model_grid_search.best_params_}") # %% [markdown] -# The `GridSearchCV` estimator takes a `param_grid` parameter which defines all -# hyperparameters and their associated values. The grid-search is in charge -# of creating all possible combinations and test them. -# -# The number of combinations are equal to the product of the number of values to -# explore for each parameter (e.g. in our example 4 x 3 combinations). Thus, -# adding new parameters with their associated values to be explored become -# rapidly computationally expensive. -# -# Once the grid-search is fitted, it can be used as any other predictor by -# calling `predict` and `predict_proba`. Internally, it uses the model with the +# Once the grid-search is fitted, it can be used as any other estimator, i.e. it +# has `predict` and `score` methods. Internally, it uses the model with the # best parameters found during `fit`. # -# Get predictions for the 5 first samples using the estimator with the best -# parameters. +# Let's get the predictions for the 5 first samples using the estimator with the +# best parameters: # %% model_grid_search.predict(data_test.iloc[0:5]) # %% [markdown] -# You can know about these parameters by looking at the `best_params_` -# attribute. +# Finally, we check the accuracy of our model using the test set. # %% -print(f"The best set of parameters is: {model_grid_search.best_params_}") +accuracy = model_grid_search.score(data_test, target_test) +print( + f"The test accuracy score of the grid-search pipeline is: {accuracy:.2f}" +) # %% [markdown] -# The accuracy and the best parameters of the grid-searched pipeline are similar +# The accuracy and the best parameters of the grid-search pipeline are similar # to the ones we found in the previous exercise, where we searched the best -# parameters "by hand" through a double for loop. +# parameters "by hand" through a double `for` loop. +# +# ## The need for a validation set +# +# In the previous section, the selection of the best hyperparameters was done +# using the train set, coming from the initial train-test split. Then, we +# evaluated the generalization performance of our tuned model on the left out +# test set. This can be shown schematically as follows: +# +# ![Cross-validation tuning +# diagram](../figures/cross_validation_train_test_diagram.png) +# +# ```{note} +# This figure shows the particular case of **K-fold** cross-validation strategy +# using `n_splits=5` to further split the train set coming from a train-test +# split. For each cross-validation split, the procedure trains a model on all +# the red samples, evaluates the score of a given set of hyperparameters on the +# green samples. The best combination of hyperparameters `best_params` is selected +# based on those intermediate scores. +# +# Then a final model is refitted using `best_params` on the concatenation of the +# red and green samples and evaluated on the blue samples. +# +# The green samples are sometimes referred as the **validation set** to +# differentiate them from the final test set in blue. +# ``` # # In addition, we can inspect all results which are stored in the attribute # `cv_results_` of the grid-search. We filter some specific columns from these