This directory contains the input dataset for the Student Depression Data Pipeline and Prediction project. The dataset, student_depression_dataset.csv, captures student mental health data used to analyze depression prevalence and predict depression outcomes through a machine learning model.
-
student_depression_dataset.csv- Description: A CSV file containing raw data on student demographics, academic factors, lifestyle habits, and mental health indicators.
- Size: Sample includes 27.9K rows.
- Source: Synthetic or anonymized data (assumed, as no source is specified).
- Location:
./student_depression_dataset.csv
-
README.md(this file)- Description: Documentation of the dataset, its structure, and its role in the project.
The dataset comprises 18 columns, representing various attributes of students, primarily from India, based on city names (e.g., Bangalore, Mumbai, Srinagar). It is ingested into the Snowflake table STUDENT_DEPRESSION_DATASET.PUBLIC.BRONZE_STUDENT_DATA as the Bronze layer of the Medallion Architecture. Below is the schema and a description of each field:
| Column Name | Data Type | Description |
|---|---|---|
ID |
NUMBER(38,0) | Unique identifier for each student record. |
GENDER |
VARCHAR(16777216) | Gender of the student (e.g., Male, Female). |
AGE |
NUMBER(38,0) | Age of the student (e.g., 18-39 in sample). |
CITY |
VARCHAR(16777216) | City of residence (e.g., Bangalore, Chennai, Kalyan). |
PROFESSION |
VARCHAR(16777216) | Occupation (mostly "Student"; one "Civil Engineer" and "Architect" noted). |
ACADEMIC_PRESSURE |
NUMBER(38,0) | Self-reported academic pressure level (1-5 scale). |
WORK_PRESSURE |
NUMBER(38,0) | Self-reported work pressure level (mostly 0 for students). |
CGPA |
FLOAT | Cumulative Grade Point Average (e.g., 5.03-9.97 in sample). |
STUDY_SATISFACTION |
NUMBER(38,0) | Satisfaction with studies (1-5 scale). |
JOB_SATISFACTION |
NUMBER(38,0) | Satisfaction with job (mostly 0, as most are students). |
SLEEP_DURATION |
VARCHAR(16777216) | Daily sleep duration (e.g., "Less than 5 hours", "5-6 hours", "7-8 hours", "More than 8 hours"). |
DIETARY_HABITS |
VARCHAR(16777216) | Dietary quality (e.g., "Healthy", "Moderate", "Unhealthy"). |
DEGREE |
VARCHAR(16777216) | Academic degree pursued (e.g., "Class 12", "BSc", "M.Tech", "PhD"). |
SUICIDAL_THOUGHTS |
VARCHAR(16777216) | Presence of suicidal thoughts (Yes/No). |
WORK_STUDY_HOURS |
NUMBER(38,0) | Hours spent on work or study per day (0-12 in sample). |
FINANCIAL_STRESS |
NUMBER(38,0) | Self-reported financial stress level (1-5 scale). |
FAMILY_HISTORY |
VARCHAR(16777216) | Family history of mental illness (Yes/No). |
DEPRESSION |
NUMBER(38,0) | Depression indicator (0 = No, 1 = Yes). |
- Demographics: Ages range from 18 to 39, with a mix of male (60%) and female (40%) students in the sample.
- Academic Pressure: Varies from 1 to 5, with some correlation to CGPA (e.g., higher pressure often linked to lower CGPA).
- Depression Prevalence: Approximately 60% of the sample reports depression (147 "Yes" out of 242).
- Sleep and Diet: "Less than 5 hours" and "Unhealthy" dietary habits are common among those with depression.
This dataset drives the following outputs in the pipeline:
- Bronze Layer: Raw data ingested into
BRONZE_STUDENT_DATA(see schema above). - Silver Layer: Cleaned and standardized data in
SILVER_STUDENT_DATA, removing inconsistencies (e.g., handling rare non-student professions). - Gold Layer: Aggregated insights in
GOLD_STUDENT_INSIGHTS, such as depression rates by gender, CGPA vs. academic pressure trends. - Visualizations: Plots like
depression_rate.png,cgpa_pressure.png, andage_distribution.pnginexample/. - Prediction Model: A Random Forest Classifier trained on features (e.g.,
ACADEMIC_PRESSURE,CGPA,SLEEP_DURATION) to predictDEPRESSION, saved asmodel/depression_model.joblib.
See the root README.md and example/README.md for detailed results and visualizations.
Based on the dataset’s structure and sample, here are suggestions to enhance its utility:
- Feature Enrichment:
- Add Stress Coping Mechanisms (e.g., exercise, meditation) to explore protective factors against depression.
- Include Social Support (e.g., friends, family) as a variable, as it often influences mental health outcomes.
- Granularity:
- Break down
SLEEP_DURATIONinto numeric hours (e.g., 4, 6, 8) for finer statistical analysis. - Categorize
DEGREEinto levels (e.g., High School, Undergraduate, Postgraduate) for trend analysis.
- Break down
- External Correlation:
- Link city data to socioeconomic indices (e.g., cost of living in Bangalore vs. Srinagar) to assess environmental impact on stress.
- Incorporate academic calendar events (e.g., exam periods) to contextualize
ACADEMIC_PRESSURE.
- Model Improvement:
- Use additional features like
FAMILY_HISTORYandSUICIDAL_THOUGHTSto improve depression prediction accuracy. - Explore time-series analysis if longitudinal data (e.g., repeated measures per student) could be collected.
- Use additional features like
These enhancements could deepen insights into student mental health drivers and improve the predictive power of the model, benefiting stakeholders like educators, counselors, and policymakers.
- Place
student_depression_dataset.csvin this directory (data/). - Run the pipeline starting with ingestion:
python [code/ingest.py](../code/ingest.py)to load data into Snowflake.- Follow steps in the root
README.mdfor processing, visualization, and modeling.
- Ethical Use: This dataset involves sensitive mental health information. Ensure compliance with privacy standards (e.g., anonymization) in real-world applications.
For further details, refer to the pipeline scripts in code/ and outputs in example/.