Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/build_jb.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@ on:
pull_request:
branches:
- '*' # Pull requests to all branches
push:
branches:
- main


jobs:
build-jupyter-book:
Expand Down
10,145 changes: 9,882 additions & 263 deletions practicals_jn_book/week_4/finalbook.ipynb

Large diffs are not rendered by default.

63 changes: 37 additions & 26 deletions practicals_jn_book/week_4/finalbook.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
#
# First, import the csv file named 'FIFA_18_basic' as a pandas dataframe and declare it to a variable named 'df'. Next, inspect df using the .head(), .tail(), .describe(), and .info() methods in the IPython console.

# In[4]:
# In[3]:


path = os.getcwd()
Expand Down Expand Up @@ -71,9 +71,9 @@
#
# First, visualize the distribution of the 'overall' player rating score by plotting a normalized histogram of these scores. Normalized histograms plot the bin-frequencies in percentages rather than in absolute numbers.
#
# One way to do this is by calling the .histplot() function from the *seaborn* library (imported as **sns**). As keyword arguments you will need the 'overall' column of the dataframe, and you will have to specify norm_hist=True. Don't worry about labels, titles or fancy colors, as this is not the goal of today's lesson.
# One way to do this is by calling the .histplot() function from the *seaborn* library (imported as **sns**). As keyword arguments you will need the 'overall' column of the dataframe. Don't worry about labels, titles or fancy colors, as this is not the goal of today's lesson.

# In[3]:
# In[4]:


sns.histplot(df['overall'], color='blue', label='Overall Rating', edgecolor=None, alpha=0.4, kde=True, bins=49)
Expand All @@ -91,7 +91,7 @@
#
# Now, plot a pairplot that results in a 7x7 graph of the dataframe. Again, although we are sure there are some genuine Picasso's amongst you, don't let the artist within distract you from the main goal of today's practical. There will be plenty of opportunities to demonstrate you plot costumization skills in the final projects.

# In[4]:
# In[5]:


sns.pairplot(df, vars=['pac', 'sho', 'pas', 'dri', 'def', 'phy', 'overall'])
Expand All @@ -118,28 +118,28 @@
#
# We will tell you what these functions do later on, but first start with importing them. If you don't know exactly how this works, you can check the code below to see what to do.

# In[5]:
# In[39]:


from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import PolynomialFeatures, StandardScaler


# **Define X and y**
#
# First, we have to create an outcome or target variable called (by convention) `y`. This is the dependent variable we will predict in the regression model. To achieve this, we will extract all values of the column containing dependent variable ('overall') and put them in a variable we call y

# In[6]:
# In[24]:


y = df['overall'] # single square brackets, thus it is a pd.Series


# Next, we have to create our input variable called (by convention) `X`, that is made up out of all the features or independent variables in our dataset. There are multiple ways to do this. The easiest way is by specifying all the column names of the features you want to select (remember double brackets). Declare the features to a variable called `X`.

# In[7]:
# In[25]:


X = df[['pac', 'sho', 'pas', 'dri', 'def', 'phy']] # double square brackets, thus it is a pd.DataFrame
Expand All @@ -155,7 +155,7 @@
# So no matter what kind of problem (classification or regression) you are facing, you will always need a training-dataset and a test-set. Therefore, we will be using the `train_test_split` function of sklearn. This function automatically creates a training and a test-dataset by randomly drawing samples from the data. By convention, we again have a pretty standardized way of doing this, and it is pretty much the same for all machine learning problems you will encounter in this course.
# ```

# In[8]:
# In[26]:


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Expand Down Expand Up @@ -209,16 +209,16 @@
#
# 2. Fit the model to your training-data by calling the `fit()` method on linreg. As arguments you will have to specify the `X_train` and `y_train` variables.

# In[21]:
# In[ ]:


linreg = LinearRegression()
linreg.fit(X_train, y_train);
linreg.fit(X_train, y_train)


# Now that we have a trained model, we will have to evaluate the performance of our model on the test-set. To do so, we will first predict the values in the test set by calling the `predict()` method on linreg and specifying `X_test` as an argument. What happens here is that we used the trained model to predict `y` ('overall') for every row of data in the test based on the independent variables of that row. Predict the values in the test-set and declare them to variable called `y_pred`.

# In[22]:
# In[11]:


y_pred = linreg.predict(X_test)
Expand All @@ -236,7 +236,7 @@
#
# Now print both the $R^2$ and RMSE in your console, how well did the model perform?

# In[25]:
# In[12]:


R2 = linreg.score(X_test, y_test)
Expand All @@ -252,7 +252,7 @@
#
# Construct a string that represents the complete regression equation, what is the most important feature?

# In[28]:
# In[13]:


coefs = linreg.coef_
Expand All @@ -270,7 +270,7 @@
#
# First, instantiate this function by calling `polynomial_2 = PolynomialFeatures(degree=2)`. This will result in the construction of a 2nd-order function (ofcourse, we could also specify degree=3 for a 3rd-order, or even degree=100 if you think that makes sense and have sufficient computing power to actually fit your data on that model).

# In[29]:
# In[14]:


polynomial_2 = PolynomialFeatures(degree=2)
Expand All @@ -280,28 +280,39 @@
#
# To achieve this, create a variable X_poly_train by calling `polynomial_2.fit_transform(X_train)`. Do the same for `X_test` and declare it to a variable called `X_poly_test`.

# In[30]:
# In[ ]:


X_poly_train = polynomial_2.fit_transform(X_train)
X_poly_test = polynomial_2.fit_transform(X_test)

print(X_train.shape, X_poly_train.shape)
for idx in range(X_poly_train.shape[1]):
print(idx, X_poly_train[:, idx].mean().round(2), " +- ", X_poly_train[:, idx].std().round(2))

# What we did here is transform our features with a polynomial function. Now we can fit our linear regression model as if it is a polynomial model, pretty cool right? To do so, just redo step 5 - 7 in similar fashion with different variable names (as you want to compare both models in the end). Have you forgetten the steps already? I will give you a short recap:

# What we did here is transform our features with a polynomial function. Please check and figure out how we went from 6 features to 28 with a poly order of 2! Another thing to notice is that the first column of the transformed features is all 1's. This is because we need an intercept in our regression function. Secondly, you can also see that the poly increased the range of values in the features. This is because we are now dealing with polynomial features, and the range of values is much larger than the original features. This means that you always need to scale your features **after** applying the polynomial function, so that the features are on the same scale for the model fitting.
#
# Now we can fit our linear regression model as if it is a polynomial model, pretty cool right? To do so, just redo step 5 - 7 in similar fashion with different variable names (as you want to compare both models in the end). Have you forgetten the steps already? I will give you a short recap:
#
# 1. instantiate a `LinearRegression()` model and declare it a variable called `linreg_poly`
# 2. fit linreg_poly to X_poly_train and y_train.
# 3. predict y used the trained model and declare the predictions to a variable called y_pred_poly
# 4. compute $R^2$ and RMSE for the linreg_poly model and declare them to R2_Poly and RMSE_poly.

# In[31]:
# In[40]:


scaler_poly = StandardScaler()
X_poly_train_scaled = scaler_poly.fit_transform(X_poly_train)
X_poly_test_scaled = scaler_poly.transform(X_poly_test)


linreg_poly = LinearRegression()
linreg_poly.fit(X_poly_train, y_train)
y_pred_poly = linreg_poly.predict(X_poly_test)
linreg_poly.fit(X_poly_train_scaled, y_train)
y_pred_poly = linreg_poly.predict(X_poly_test_scaled)

R2_poly = linreg_poly.score(X_poly_test, y_test)
R2_poly = linreg_poly.score(X_poly_test_scaled, y_test)
MSE_poly = mean_squared_error(y_test, y_pred_poly)
RMSE_poly = np.sqrt(MSE_poly)

Expand All @@ -311,7 +322,7 @@
# >Finally, it's time for the million dollar question, which model performs the best? Print the $R^2$ and RMSE Scores of both
# >the linear and quadratic regression model to the console and find out for yourself.

# In[33]:
# In[44]:


get_ipython().run_cell_magic('capture', '', "print('Linear Regression Model: \\n')\nprint('R^2: %f' %R2)\nprint('RMSE: %f' %RMSE)\nprint('----------')\nprint('Quadratic Regression Model: \\n')\nprint('R^2: %f' %R2_poly)\nprint('RMSE: %f' %RMSE_poly)\n")
Expand All @@ -327,7 +338,7 @@
#
# Start with reading the sklearn [documentation](https://scikit-learn.org/stable/modules/linear_model.html) to understand what the different types of regression entail:

# In[ ]:
# In[18]:


df = pd.read_csv(os.path.join(path, "FIFA_18_complete.csv"), index_col=0)
Expand All @@ -345,7 +356,7 @@
# You are not allowed to use someone's wage or release clause, nor use preferred position variables, images of a players head or anything similar. Focus on someones performance attributes.
# ```

# In[37]:
# In[19]:


df.head()
Expand All @@ -355,7 +366,7 @@
df.corr(numeric_only=True)


# In[38]:
# In[20]:


X_cols = ["special", "age", "overall", "potential", "pac",
Expand All @@ -365,7 +376,7 @@
y_col = ["eur_value"]


# In[39]:
# In[21]:


from sklearn.linear_model import Lasso, Ridge
Expand Down
12 changes: 6 additions & 6 deletions practicals_jn_book/week_5/finalbook.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -560,7 +560,7 @@
"\n",
"Alternatively we can use label encoding. This is also part of the pandas library and will be our method of choice. To encode the patient labels stored in `y`, we have to do a couple of things. Remember the dtype of the `'class'` column in `df`? If not check back the inspection you did in step 1. Label encoding only works with categorical data, so we have to change the dtype of `y` to categorical. \n",
"\n",
"First call `.astype('category')` on `y` to change the dtype. Make sure to declare the result back to `y`. Next, use `y.cat.codes` to change the labels in `y` to numbers. Again, make to sure to declare the result back to `y`. Now print `y` to your console, did you succeed? What number represents `'Abnormal'` patients? And what number `'Normal'` patients. "
"First call `.astype('category')` on `y` to change the dtype. Make sure to declare the result back to `y`. Next, use `y.cat.codes` to change the labels in `y` to numbers. Again, make sure to declare the result back to `y`. Now print `y` to your console, did you succeed? What number represents `'Abnormal'` patients? And what number `'Normal'` patients. "
]
},
{
Expand Down Expand Up @@ -593,7 +593,7 @@
"\n",
"in which $w_{1}$ and $w_{2}$ are coefficients of features $X_{1}$ and $X_{1}$ respectively, and $c$ is the model constant or intercept. As sport & movement scientists, not only are we interested in the accuracy of such a model, but also in the importance of individual features. How much does the `'sacral_slope'` for example contribute to the patient being `'Normal'` or `'Abnormal'`? As the features are measured on different scales, differences in coefficients do not neccesarily mean anything to us. \n",
"\n",
"Even if you are not interested in the individual weights (which is a questionable decision), scaling could be crucial. As you have learned, machine learning models often optimize using gradient descent. Gradient descent tries to find the lowest location/cost in a multidimensional space. If they variables are on a different scale, the multidimensional space changes, which makes the model optimize for the variables with the biggest range, which in turn can lead to suboptimal performance of the model. \n",
"Even if you are not interested in the individual weights (which is a questionable decision), scaling could be crucial. As you have learned, machine learning models often optimize using gradient descent. Gradient descent tries to find the lowest location/cost in a multidimensional space. If the variables are on a different scale, the multidimensional space changes, which makes the model optimize for the variables with the biggest range, which in turn can lead to suboptimal performance of the model. \n",
"\n",
"Therefore, we have to scale all the features to the same scale (note that: first of all, not all ML models require scaling, decision tree classification models & for example deliver interpretable results without scaling. Second of all, if you do not care about coefficients but only about accuracy, scaling is not really neccesary either. It is however best practice to do so). In this case, we will be using Logistic Regression, KNN **&** Decision Trees, for Logistic Regression and KNN we will use the scaled features, for Decision Trees the unscaled features.\n",
"\n",
Expand All @@ -619,7 +619,7 @@
"\n",
"Create variables called X_train, X_test, y_train, y_test by using the train_test_split function with X, y as arguments, a test_size of 30% and a random_state. Furthermore specify `stratify=y` to make sure `'Abnormal'` patients are present equally in both test and training set. \n",
"\n",
"Furthermore, Create variables called X_scaled_train, X_scaled_test, y_train, y_test by using the train_test_split function with X_scaled, y as arguments, and again a test_size of 30% and a random_state. We did not really have to create y_train and y_test again, as they are the same both time. However, as this is the output of the train_test_split function it is easiest to just save the same result again. "
"Furthermore, create variables called X_scaled_train, X_scaled_test, y_train, y_test in the right order by using the StandardScaler function."
]
},
{
Expand Down Expand Up @@ -656,7 +656,7 @@
"\n",
"### Assignment 4: Model fitting\n",
"\n",
"Finally we came to the core of machine learning: Fitting a model to your data. Remember the common workflow from last week:\n",
"Finally, we are at the core of machine learning: Fitting a model to your data. Remember the common workflow from last week:\n",
"\n",
"```{note}\n",
"1. instantiate the model function (i.e. `LogisticRegression()`) and declare it a convenient variable (i.e. `log_reg`). Specify the neccesary keyword arguments here. \n",
Expand Down Expand Up @@ -823,7 +823,7 @@
"\n",
"Since `cross_validate` creates your train-test splits, you can not prevent data leakage with this function if you need to scale your data. If you include `X_scaled` in the `cross_validate` function, it has already learned based on the scaling parameters, and will split the data afterwards within the function. Although the functions of sklearn are often great, you lose the possibility to make small changes, like scaling after creating the train test split. \n",
"\n",
"Create your own function `scaled_cross_validation` that makes it possible to add models that need scaling, without any data leakages. Do not forget to create some nice docstrings in your function! Use your function to print the performance of the KNN and the decision tree classifiers.\n",
"Create your own function `scaled_cross_validation` that makes it possible to add models that need scaling, without any data leakages. Do not forget to create some nice docstrings in your function! Use your function to print the performance of the KNN and the logistic regression.\n",
"\n",
":::: {tip}\n",
"Use the `StratifiedKFold.split` and loop over each stratified fold to get the train and test indexes! \n",
Expand Down Expand Up @@ -1025,7 +1025,7 @@
"\n",
"For every model declare the results to 3 variables like the example below:\n",
"```python\n",
"tpr_logreg, tpr_logreg, threshold_log_reg = roc_curve(y_test, probs_log_reg[:,1])\n",
"fpr_logreg, tpr_logreg, threshold_log_reg = roc_curve(y_test, probs_log_reg[:,1])\n",
"```\n",
"\n",
"you specify `[:,1]` because we only need column 1 from the probability array. The first index (`0`) gives the probability for the label 0, in our case Normal, and the second index (`1`) the probability for the label 1 or Abnormal."
Expand Down
Loading