This project aims to predict the salary of employees based on their years of experience using supervised machine learning techniques. Two regression models, Linear Regression and Decision Tree Regression, are implemented and compared.
- Dataset Name:
salary_data.csv - Columns:
employee_id: Unique identifier for each employee.experience_years: Years of experience of the employee.salary: Salary of the employee.
- Dataset is read and loaded using
pandas. - Initial exploration is done with:
data.head()data.info()data.describe()
- A scatter plot visualizes the relationship between
experience_yearsandsalaryusingseabornandmatplotlib.
- Checked for duplicates and removed them.
- Checked for missing values (none found).
- Dataset was split into predictors (
X) and target (y). - Data was further split into training and testing sets:
- Train/Test Ratio: 75/25
- Fitted a linear regression model using
sklearn.linear_model.LinearRegression. - Plotted actual vs predicted values.
- Evaluated using:
- Mean Squared Error (MSE)
- R-squared (R²) score
- Fitted a decision tree regressor using
sklearn.tree.DecisionTreeRegressor. - Plotted actual vs predicted values.
- Evaluated using:
- Mean Squared Error (MSE)
- R-squared (R²) score
- Model Equation:
y = 1641.366 + 103.197 * x - Evaluation Metrics:
- Train MSE: 107699.85
- Test MSE: 128111.12
- Train R²: 0.77
- Test R²: 0.63
- Evaluation Metrics:
- Train MSE: 88.12
- Test MSE: 128311.56
- Train R²: 1.00
- Test R²: 0.61
- Python 3.8+
- Libraries:
pandasnumpymatplotlibseabornscikit-learn
- Install required libraries:
pip install pandas numpy matplotlib seaborn scikit-learn
- Run the script in your Python environment.
- Ensure the
salary_data.csvfile is in the working directory.
- There is a strong positive correlation between years of experience and salary.
- Linear regression provides a simpler model but is less precise than the decision tree on this dataset.
- Decision trees overfit the training data but show similar performance to linear regression on the test data.
- Use additional features to improve model performance.
- Experiment with more advanced regression models like Random Forest or Gradient Boosting.
- Perform hyperparameter tuning for Decision Tree to reduce overfitting.
For any queries or contributions, please reach out at: bimadev06@gmail.com.
