Skip to content

An R-based statistical inference project investigating the drivers of student academic performance. It moves beyond simple prediction to isolate statistically significant factors using multivariate regression, ANOVA, and t-tests.

License

Notifications You must be signed in to change notification settings

Sanaurrehmanarain/student-performance-analysis-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

student-performance-analysis-R

Click the banner to view the full analysis report

📊 Statistical Analysis of Student Performance

Project Status: ✅ Completed
Author: Sana Ur Rehman Arain
Tools: R, Tidyverse, ggplot2, Caret, RMarkdown


📌 Overview

This project applies statistical inference and multivariate regression analysis to identify the key drivers of student academic performance.

Unlike standard machine learning projects that emphasize prediction accuracy, this study focuses on causal inference — isolating which factors significantly influence final grades while holding other variables constant.

📚 Dataset: UCI Student Performance Dataset (Math course)
📈 Sample Size: 395 students


🧠 Key Statistical Findings

1️⃣ Failures Are the Strongest Predictor

  • Past academic failures have a large and highly significant negative impact
  • Each additional failure reduces the final grade by approximately 1.93 points
  • p-value < 0.001

2️⃣ Socioeconomic Advantage — Mother's Job

  • ANOVA shows Mother’s Job significantly affects student performance
    (p = 0.005)
  • Tukey HSD post-hoc test reveals:
    • Students whose mothers work in Health professions score on average
      ~3 points higher than those whose mothers stay at home
    • Adjusted p-value = 0.018

3️⃣ Internet Access Matters

  • Welch Two-Sample t-test confirms a significant difference:
    • With Internet: Mean G3 = 10.62
    • Without Internet: Mean G3 = 9.41
  • p-value = 0.049

4️⃣ Social Life Trade-Off

  • Going out with friends (goout, scale 1–5) has a negative impact
  • Each unit increase reduces grades by approximately 0.42 points
  • p-value = 0.029

📈 Methodology

🔬 Hypothesis Testing

  • T-Test:
    Do students with internet access perform better? → ✅ Yes
  • ANOVA:
    Does parent job type affect academic outcomes? → ✅ Yes
    (Notably the Health sector)

📐 Multivariate Linear Regression

  • Target Variable: G3 (Final Grade)
  • Predictors:
    • studytime
    • failures
    • absences
    • Medu (Mother’s Education)
    • Fedu (Father’s Education)
    • goout

Model Performance:

  • R²: 0.162
  • Adjusted R²: 0.149
  • Interpretation: The model explains ~16% of variance in final grades — reasonable for social science data.

🧪 Model Diagnostics

Regression assumptions were validated using residual diagnostics:

  • No major violations of normality
  • Homoscedasticity reasonably satisfied
  • Influential points within acceptable limits

📊 Diagnostics and supporting visualizations are included below.


🖼️ Visual Results

📉 Grade Distribution

Grade Distribution

📊 Study Time vs Final Grade

Study Time vs Grade

🧑‍⚕️ Mother's Job vs Grade

Mother Job vs Grade

🔗 Correlation Matrix (Numeric Variables)

Correlation Matrix

🧪 Regression Diagnostics

Model Diagnostics


📂 Project Structure

student-performance-analysis-R/
│
├── data/
│   ├── raw/
│   │   └── student-mat.csv          # Original UCI dataset
│   └── processed/
│       └── student_clean.csv        # Cleaned dataset
│
├── scripts/
│   ├── 01_data_cleaning.ipynb       # Data preprocessing
│   ├── 02_eda.ipynb                 # Exploratory analysis & plots
│   └── 03_modeling.ipynb            # T-tests, ANOVA, Regression
│
├── results/
│   ├── distribution_G3.png
│   ├── studytime_vs_grade.png
│   ├── mjob_vs_grade.png
│   ├── correlation matric of numeric variables.png
│   └── model_diagnostics.png
│
├── report.Rmd                       # Final RMarkdown report
└── README.md                        # Project documentation

View the Report

📄 View Full Analysis Report - Download report.html and open in your browser for the complete interactive report with all visualizations and code.


🚀 How to Run the Project

  1. Clone this repository

  2. Run notebooks in the scripts/ folder in order:

    • 01_data_cleaning.ipynb

    • 02_eda.ipynb

    • 03_modeling.ipynb

  3. Knit report.Rmd to generate the full HTML report

🏁 Conclusion

This analysis challenges the simplistic belief that “more studying automatically leads to better grades.”

Instead, results show that:

- Foundational gaps (past failures)

- Socioeconomic background

- Access to learning resources

are far more influential than raw study time alone.

🎯 Recommendations

  1. Early Intervention: Students with even a single past failure should be prioritized for academic support.

  2. Equity in Resources: Schools should ensure internet access and learning support for students from disadvantaged backgrounds.

📬 For questions or collaboration, feel free to reach out.

About

An R-based statistical inference project investigating the drivers of student academic performance. It moves beyond simple prediction to isolate statistically significant factors using multivariate regression, ANOVA, and t-tests.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages