This project predicts house sale prices using machine learning techniques on the Kaggle House Prices Dataset . We implemented Linear Regression as a baseline model and then improved performance using a Random Forest Regressor.
The goal is to understand the factors affecting house prices and build a model that achieves low prediction error (measured with RMSE).
Source: Kaggle – House Prices: Advanced Regression Techniques
Size: ~1,460 rows, 80 features
Features:
Numerical: LotArea, OverallQual, YearBuilt, GrLivArea, etc.
Categorical: Neighbourhood, HouseStyle, RoofStyle, etc.
Target: SalePrice (continuous variable)
- Visualised distributions (histograms, boxplots, scatterplots).
- Checked correlations between features and target.
- Handled missing values.
- One-hot encoded categorical variables.
- Split data into training & testing sets.
- Created new features (e.g., house age, total square footage).
- Removed/combined less useful variables.
- Baseline model: Linear Regression
- Advanced model: Random Forest Regressor
- Metric: Root Mean Squared Error (RMSE)
- Compared Linear Regression vs. Random Forest.
| Linear Regression | Random Forest | |
|---|---|---|
| RMSE | 77699.05 | 28681.915963515752 |
Random Forest achieved lower RMSE, meaning it predicts house prices more accurately.
Python
Pandas, NumPy
Matplotlib, Seaborn
Scikit-learn