Data preprocessing, preparation, and feature reduction are among the most critical steps before applying any Machine Learning (ML) model — they often determine 70–80% of the success of your model’s performance.
Preprocessing - Essential to ensure data quality and consistency Preparation - Critical for representativeness and feature engineering Feature Reduction - Important for efficiency and avoiding overfitting
This work is about creating AI Solution to learn data preprocessing, preparation and Singular Value Decomposition for feature reduction using Using the UCI Communities & Crime dataset.
Here we have 128 columns total: • 122 predictive features • 5 non-predictive features • 1 goal/target variable
- Load the dataset
- Identify:
- Numeric and non-numeric columns
- Predictive and non-predictive attributes
- Exclude non-predictive attributes such as:
state,county,community,communityname
- Split predictive columns by data type:
- Numeric
- Categorical
(these columns will be used in later processing)
- Filter and retain only numeric columns
- Encode categorical columns
- Handle missing values
- Identify key predictive factors using correlation analysis
- Compute correlation between features and the target variable
- Analyze both positively and negatively correlated columns
- Positive correlation → Features that increase with the target
- Negative correlation → Features that decrease with the target
Top 5 positively correlated features are choosen which are more affected by the target

Spliting into training and testing to apply randon forest R² Score: 0.9999600609068787 And gives key predictive features ViolentCrimesPerPop 0.999918 LemasPctOfficDrugUn 0.000003 racepctblack 0.000003 population 0.000003 PctTeen2Par 0.000003 PctBSorMore 0.000003 PctYoungKids2Par 0.000003 PctKids2Par 0.000003 NumInShelters 0.000003 MedRentPctHousInc 0.000002 MalePctDivorce 0.000002 PctNotHSGrad 0.000002 PctWOFullPlumb 0.000002 racePctWhite 0.000002 TotalPctDiv 0.000002
- Prepare data for Singular Value Decomposition (SVD)
- Perform SVD to decompose the dataset into components
- Analyze the obtained components and interpret their significance for target prediction
- SVD helps identify which features contribute most to each component
- Higher component values indicate stronger feature contribution