This project analyzes a credit risk dataset to understand the relationships between customer attributes and credit outcomes, and to build classification models capable of predicting a chosen binary credit risk variable.
The analysis is structured in two main stages:
- Exploratory Data Analysis (EDA) to examine variable distributions, relationships, and data quality
- Supervised classification modeling to assess predictive performance using multiple machine learning algorithms
- Perform exploratory data analysis to understand feature distributions and relationships
- Handle missing values and mixed data types
- Build and compare multiple classification models:
- Logistic Regression
- Random Forest
- Support Vector Machine (SVM)
- Linear SVM
- Evaluate model performance using appropriate classification metrics
- Identify the most effective model for credit risk prediction
credit_risk_dataset_classification.csv
The dataset contains customer-level financial and demographic features along with a binary target variable representing credit risk.
This project is implemented in Python using the following libraries:
- numpy
- pandas
- matplotlib
- seaborn
- scikit-learn
- ColumnTransformer
- Pipeline
- OneHotEncoder
- StandardScaler
- SimpleImputer
- LogisticRegression
- RandomForestClassifier
- SVC
- LinearSVC
- DecisionTreeClassifier
- train_test_split
- StratifiedKFold
- GridSearchCV
- ROC-AUC
- Precision-Recall AUC
- F1-score
- scipy.stats (loguniform)
- matplotlib.ticker