Skip to content

Latest commit

 

History

History
91 lines (68 loc) · 7.26 KB

File metadata and controls

91 lines (68 loc) · 7.26 KB

Data Directory

This directory contains the input dataset for the Student Depression Data Pipeline and Prediction project. The dataset, student_depression_dataset.csv, captures student mental health data used to analyze depression prevalence and predict depression outcomes through a machine learning model.

Contents

  • student_depression_dataset.csv

    • Description: A CSV file containing raw data on student demographics, academic factors, lifestyle habits, and mental health indicators.
    • Size: Sample includes 27.9K rows.
    • Source: Synthetic or anonymized data (assumed, as no source is specified).
    • Location: ./student_depression_dataset.csv
  • README.md (this file)

    • Description: Documentation of the dataset, its structure, and its role in the project.

Dataset Overview

The dataset comprises 18 columns, representing various attributes of students, primarily from India, based on city names (e.g., Bangalore, Mumbai, Srinagar). It is ingested into the Snowflake table STUDENT_DEPRESSION_DATASET.PUBLIC.BRONZE_STUDENT_DATA as the Bronze layer of the Medallion Architecture. Below is the schema and a description of each field:

Column Name Data Type Description
ID NUMBER(38,0) Unique identifier for each student record.
GENDER VARCHAR(16777216) Gender of the student (e.g., Male, Female).
AGE NUMBER(38,0) Age of the student (e.g., 18-39 in sample).
CITY VARCHAR(16777216) City of residence (e.g., Bangalore, Chennai, Kalyan).
PROFESSION VARCHAR(16777216) Occupation (mostly "Student"; one "Civil Engineer" and "Architect" noted).
ACADEMIC_PRESSURE NUMBER(38,0) Self-reported academic pressure level (1-5 scale).
WORK_PRESSURE NUMBER(38,0) Self-reported work pressure level (mostly 0 for students).
CGPA FLOAT Cumulative Grade Point Average (e.g., 5.03-9.97 in sample).
STUDY_SATISFACTION NUMBER(38,0) Satisfaction with studies (1-5 scale).
JOB_SATISFACTION NUMBER(38,0) Satisfaction with job (mostly 0, as most are students).
SLEEP_DURATION VARCHAR(16777216) Daily sleep duration (e.g., "Less than 5 hours", "5-6 hours", "7-8 hours", "More than 8 hours").
DIETARY_HABITS VARCHAR(16777216) Dietary quality (e.g., "Healthy", "Moderate", "Unhealthy").
DEGREE VARCHAR(16777216) Academic degree pursued (e.g., "Class 12", "BSc", "M.Tech", "PhD").
SUICIDAL_THOUGHTS VARCHAR(16777216) Presence of suicidal thoughts (Yes/No).
WORK_STUDY_HOURS NUMBER(38,0) Hours spent on work or study per day (0-12 in sample).
FINANCIAL_STRESS NUMBER(38,0) Self-reported financial stress level (1-5 scale).
FAMILY_HISTORY VARCHAR(16777216) Family history of mental illness (Yes/No).
DEPRESSION NUMBER(38,0) Depression indicator (0 = No, 1 = Yes).

Sample Data Insights

  • Demographics: Ages range from 18 to 39, with a mix of male (60%) and female (40%) students in the sample.
  • Academic Pressure: Varies from 1 to 5, with some correlation to CGPA (e.g., higher pressure often linked to lower CGPA).
  • Depression Prevalence: Approximately 60% of the sample reports depression (147 "Yes" out of 242).
  • Sleep and Diet: "Less than 5 hours" and "Unhealthy" dietary habits are common among those with depression.

Results Supported by the Dataset

This dataset drives the following outputs in the pipeline:

  1. Bronze Layer: Raw data ingested into BRONZE_STUDENT_DATA (see schema above).
  2. Silver Layer: Cleaned and standardized data in SILVER_STUDENT_DATA, removing inconsistencies (e.g., handling rare non-student professions).
  3. Gold Layer: Aggregated insights in GOLD_STUDENT_INSIGHTS, such as depression rates by gender, CGPA vs. academic pressure trends.
  4. Visualizations: Plots like depression_rate.png, cgpa_pressure.png, and age_distribution.png in example/.
  5. Prediction Model: A Random Forest Classifier trained on features (e.g., ACADEMIC_PRESSURE, CGPA, SLEEP_DURATION) to predict DEPRESSION, saved as model/depression_model.joblib.

See the root README.md and example/README.md for detailed results and visualizations.

Potential Value-Adding Insights

Based on the dataset’s structure and sample, here are suggestions to enhance its utility:

  1. Feature Enrichment:
    • Add Stress Coping Mechanisms (e.g., exercise, meditation) to explore protective factors against depression.
    • Include Social Support (e.g., friends, family) as a variable, as it often influences mental health outcomes.
  2. Granularity:
    • Break down SLEEP_DURATION into numeric hours (e.g., 4, 6, 8) for finer statistical analysis.
    • Categorize DEGREE into levels (e.g., High School, Undergraduate, Postgraduate) for trend analysis.
  3. External Correlation:
    • Link city data to socioeconomic indices (e.g., cost of living in Bangalore vs. Srinagar) to assess environmental impact on stress.
    • Incorporate academic calendar events (e.g., exam periods) to contextualize ACADEMIC_PRESSURE.
  4. Model Improvement:
    • Use additional features like FAMILY_HISTORY and SUICIDAL_THOUGHTS to improve depression prediction accuracy.
    • Explore time-series analysis if longitudinal data (e.g., repeated measures per student) could be collected.

These enhancements could deepen insights into student mental health drivers and improve the predictive power of the model, benefiting stakeholders like educators, counselors, and policymakers.

How to Use

  1. Place student_depression_dataset.csv in this directory (data/).
  2. Run the pipeline starting with ingestion:
    • python [code/ingest.py](../code/ingest.py) to load data into Snowflake.
    • Follow steps in the root README.md for processing, visualization, and modeling.

Notes

  • Ethical Use: This dataset involves sensitive mental health information. Ensure compliance with privacy standards (e.g., anonymization) in real-world applications.

For further details, refer to the pipeline scripts in code/ and outputs in example/.