Project Status: ✅ Completed
Author: Sana Ur Rehman Arain
Tools: R, Tidyverse, ggplot2, Caret, RMarkdown
This project applies statistical inference and multivariate regression analysis to identify the key drivers of student academic performance.
Unlike standard machine learning projects that emphasize prediction accuracy, this study focuses on causal inference — isolating which factors significantly influence final grades while holding other variables constant.
📚 Dataset: UCI Student Performance Dataset (Math course)
📈 Sample Size: 395 students
- Past academic failures have a large and highly significant negative impact
- Each additional failure reduces the final grade by approximately 1.93 points
- p-value < 0.001
- ANOVA shows Mother’s Job significantly affects student performance
(p = 0.005) - Tukey HSD post-hoc test reveals:
- Students whose mothers work in Health professions score on average
~3 points higher than those whose mothers stay at home - Adjusted p-value = 0.018
- Students whose mothers work in Health professions score on average
- Welch Two-Sample t-test confirms a significant difference:
- With Internet: Mean G3 = 10.62
- Without Internet: Mean G3 = 9.41
- p-value = 0.049
- Going out with friends (
goout, scale 1–5) has a negative impact - Each unit increase reduces grades by approximately 0.42 points
- p-value = 0.029
- T-Test:
Do students with internet access perform better? → ✅ Yes - ANOVA:
Does parent job type affect academic outcomes? → ✅ Yes
(Notably the Health sector)
- Target Variable:
G3(Final Grade) - Predictors:
studytimefailuresabsencesMedu(Mother’s Education)Fedu(Father’s Education)goout
Model Performance:
- R²: 0.162
- Adjusted R²: 0.149
- Interpretation: The model explains ~16% of variance in final grades — reasonable for social science data.
Regression assumptions were validated using residual diagnostics:
- No major violations of normality
- Homoscedasticity reasonably satisfied
- Influential points within acceptable limits
📊 Diagnostics and supporting visualizations are included below.
student-performance-analysis-R/
│
├── data/
│ ├── raw/
│ │ └── student-mat.csv # Original UCI dataset
│ └── processed/
│ └── student_clean.csv # Cleaned dataset
│
├── scripts/
│ ├── 01_data_cleaning.ipynb # Data preprocessing
│ ├── 02_eda.ipynb # Exploratory analysis & plots
│ └── 03_modeling.ipynb # T-tests, ANOVA, Regression
│
├── results/
│ ├── distribution_G3.png
│ ├── studytime_vs_grade.png
│ ├── mjob_vs_grade.png
│ ├── correlation matric of numeric variables.png
│ └── model_diagnostics.png
│
├── report.Rmd # Final RMarkdown report
└── README.md # Project documentation
📄 View Full Analysis Report - Download report.html and open in your browser for the complete interactive report with all visualizations and code.
-
Clone this repository
-
Run notebooks in the
scripts/folder in order:-
01_data_cleaning.ipynb -
02_eda.ipynb -
03_modeling.ipynb
-
-
Knit
report.Rmdto generate the full HTML report
This analysis challenges the simplistic belief that “more studying automatically leads to better grades.”
Instead, results show that:
- Foundational gaps (past failures)
- Socioeconomic background
- Access to learning resources
are far more influential than raw study time alone.
-
Early Intervention: Students with even a single past failure should be prioritized for academic support.
-
Equity in Resources: Schools should ensure internet access and learning support for students from disadvantaged backgrounds.




