This project implements a neural network from scratch using NumPy to predict the probability that the home team wins an NFL game, given a snapshot of the game state.
Avoided the use of PyTorch, TensorFlow & autograd.
- Implementing a neural network from scratch
- Understanding forward and backward propagation
- Implementing gradient descent
- Preventing data leakage in datasets
- Building a complete machine learning pipeline
- How predictions connect to loss
- How gradients are computed via the chain rule
- How gradient descent actually updates parameters
- How data leakage happens and how to prevent it
- How to validate backprop with numerical gradient checking
Predict the probability that the home team wins the game.
A snapshot of the game at a single moment in time.
A probability in the range [0, 1].
home_win = 1 if the home team eventually wins the game, otherwise 0.
The dataset is derived from NFL play by play data sourced from nflfastR via nflreadr.
Each row in the final dataset represents a game state snapshot from which we predict.
| Feature | Description |
|---|---|
| score_diff | home_score minus away_score |
| seconds_remaining | Seconds remaining in the game |
| quarter | Game quarter (1 to 4) |
| down | Down (1 to 4) |
| yards_to_go | Yards needed for a first down |
| yardline_100 | Distance to opponent end zone |
| possession_is_home | 1 if home team has possession, else 0 |
| Label | Description |
|---|---|
| home_win | 1 if home team wins the game |
This project uses a single hidden layer neural network.
X (N, D) -> z1 = X @ W1 + b1 -> h = tanh(z1) -> z2 = h @ W2 + b2 -> yhat = sigmoid(z2) -> loss = binary_cross_entropy(yhat, y)
Where:
- N = number of samples
- D = number of input features
- W1, b1 = weights and biases for hidden layer
- W2, b2 = weights and biases for output layer
- tanh = hyperbolic tangent activation function
- sigmoid = logistic sigmoid activation function
- binary_cross_entropy = loss function for binary classification
- tanh introduces nonlinearity and has a simple derivative
- sigmoid maps logits to probabilities
- binary cross entropy pairs naturally with sigmoid
Binary cross entropy:
L = -mean(y * log(yhat) + (1 - y) * log(1 - yhat))
Backpropagation is implemented manually using the chain rule.
Key identity used: dL/dz2 = (yhat - y) / N
From there:
- gradients flow backward to W2 and b2
- then through tanh using (1 - h²)
- then to W1 and b1
Every gradient is computed explicitly.
Here is the learning curve showing training and validation loss over epochs:
