This repository contains a collection of practical data science projects covering the complete machine learning workflow, including:
- π§Ή Data Cleaning
- π§ͺ Data Preprocessing
- π Data Visualization
- ποΈ Model Building & Compilation
- π Model Training & Evaluation
These projects aim to help learners and practitioners understand each phase of working with data and machine learning models.
Objective:
Clean and standardize a raw dataset with missing values, duplicates, incorrect data types, and inconsistent formatting.
Techniques Used:
- Handling missing data (mean, median, drop)
- Removing duplicates
- Converting data types
- String formatting and trimming
- Date and time conversion
Tools: pandas, numpy
π File: data_cleaning.py
Objective:
Prepare clean data for machine learning algorithms by transforming features and labels.
Techniques Used:
- Feature scaling (StandardScaler, MinMaxScaler)
- Encoding categorical variables (OneHotEncoder, LabelEncoder)
- Train-test split
- Data balancing (optional: SMOTE)
Tools: pandas, scikit-learn, numpy
Objective:
Understand the dataset using visual exploration techniques and identify patterns or anomalies.
Techniques Used:
- Histograms, box plots, scatter plots
- Correlation heatmaps
- Pair plots
- Class distribution graphs
Tools: matplotlib, seaborn, pandas
Objective:
Build a machine learning or deep learning model, compile it with appropriate configurations, and train it on prepared data.
Steps Covered:
- Defining a model (ML or DL)
- Choosing loss function, optimizer, metrics
- Model training with validation
- Accuracy and loss plots
Tools: scikit-learn, keras / tensorflow, matplotlib
Objective:
Evaluate model performance using appropriate metrics and visualize the results.
Evaluation Metrics:
- Accuracy, precision, recall, F1-score
- Confusion matrix
- ROC-AUC (for classification)