Predicting second‑hand car prices with classic tabular ML.
Data 402 006 rows · 12 columns (target =price).
Models Linear Regression · Random Forest · Gradient Boosting · Voting Ensemble.
- Project motivation
- Data
- Quick start
- Notebook & code guide
- Results at a glance
- Model interpretation
- Directory layout
Buying a used car is a price‑sensitive decision.
The goal is to build transparent, reproducible baselines that predict price
given mileage, age, fuel type and a handful of categorical descriptors.
Grades in the coursework are not the focus; clean code and solid discussion are.
- Source AutoTrader extract supplied by Manchester Metropolitan University.
The licence prohibits redistribution, so the CSV is not committed to this repository. - Rows 402 006 Columns 12 (all except
priceused as predictors). - Cleaning steps
- Trim outliers in
mileage&pricevia 1.5 × IQR. - Drop cars registered before 1975.
- Mode‑impute gaps in
fuel_type,body_type,standard_colour.
- Trim outliers in
- Engineered features
vehicle_age=2024 – year_of_registrationmileage_to_age_ratio=mileage / vehicle_age
See
notebooks/01_autotrader_walkthrough.ipynbfor the exact code.
# clone repo
git clone https://github.com/hamzahassan9320/autotrader-price-regression.git
cd autotrader-price-regression
# place the CSV in the expected location
mkdir -p data
cp /path/to/Adverts.csv data/
# set up environment
conda create -n autotrader-price python=3.10
conda activate autotrader-price
pip install -r requirements.txt
# full pipeline
python -m src.train --csv data/Adverts.csv
# run the Streamlit app locally
streamlit run app.pyTested with Python 3.10 and scikit‑learn 1.3.2.
| file | purpose |
|---|---|
notebooks/01_autotrader_walkthrough.ipynb |
data snapshot, EDA, demos |
src/data.py |
load + cleanse CSV |
src/features.py |
feature engineering & preprocessing |
src/models.py |
pipelines · param grids · grid‑search helper |
src/train.py |
one‑shot CLI training run; saves models & plots |
src/visualise.py |
regenerates figures in docs/images/ |
| model | CV MAE ↓ | Test R² |
|---|---|---|
| Linear Regression | 1 642 ± 394 | 0.79 |
| Random Forest | 1 831 ± 51 | 0.90 |
| Gradient Boosting | 2 742 ± 95 | 0.87 |
| Voting Ensemble | 1 894 ± 44 | 0.89 |
Random Forest brings the best MAE and R² without visible over‑fit.
- SHAP beeswarm → global drivers (top features:
vehicle_age,mileage). - SHAP waterfall → why a single advert (row 39) is priced ± £9 k.
- Partial dependence → price drops near‑linearly with age; flattening after ~15 yrs hints at a market floor.
All figures live in docs/images/, regenerated by src/visualise.py.
.
├── data/ # <empty> – you add Adverts.csv locally
├── notebooks/ # single exploratory notebook
├── src/ # reusable code
├── configs/ # YAML config(s)
├── docs/images/ # plots for README
└── requirements.txt
