Data Directory

This directory contains the input dataset for the Student Depression Data Pipeline and Prediction project. The dataset, student_depression_dataset.csv, captures student mental health data used to analyze depression prevalence and predict depression outcomes through a machine learning model.

student_depression_dataset.csv
- Description: A CSV file containing raw data on student demographics, academic factors, lifestyle habits, and mental health indicators.
- Size: Sample includes 27.9K rows.
- Source: Synthetic or anonymized data (assumed, as no source is specified).
- Location: ./student_depression_dataset.csv
README.md (this file)
- Description: Documentation of the dataset, its structure, and its role in the project.

Dataset Overview

The dataset comprises 18 columns, representing various attributes of students, primarily from India, based on city names (e.g., Bangalore, Mumbai, Srinagar). It is ingested into the Snowflake table STUDENT_DEPRESSION_DATASET.PUBLIC.BRONZE_STUDENT_DATA as the Bronze layer of the Medallion Architecture. Below is the schema and a description of each field:

Column Name	Data Type	Description
`ID`	NUMBER(38,0)	Unique identifier for each student record.
`GENDER`	VARCHAR(16777216)	Gender of the student (e.g., Male, Female).
`AGE`	NUMBER(38,0)	Age of the student (e.g., 18-39 in sample).
`CITY`	VARCHAR(16777216)	City of residence (e.g., Bangalore, Chennai, Kalyan).
`PROFESSION`	VARCHAR(16777216)	Occupation (mostly "Student"; one "Civil Engineer" and "Architect" noted).
`ACADEMIC_PRESSURE`	NUMBER(38,0)	Self-reported academic pressure level (1-5 scale).
`WORK_PRESSURE`	NUMBER(38,0)	Self-reported work pressure level (mostly 0 for students).
`CGPA`	FLOAT	Cumulative Grade Point Average (e.g., 5.03-9.97 in sample).
`STUDY_SATISFACTION`	NUMBER(38,0)	Satisfaction with studies (1-5 scale).
`JOB_SATISFACTION`	NUMBER(38,0)	Satisfaction with job (mostly 0, as most are students).
`SLEEP_DURATION`	VARCHAR(16777216)	Daily sleep duration (e.g., "Less than 5 hours", "5-6 hours", "7-8 hours", "More than 8 hours").
`DIETARY_HABITS`	VARCHAR(16777216)	Dietary quality (e.g., "Healthy", "Moderate", "Unhealthy").
`DEGREE`	VARCHAR(16777216)	Academic degree pursued (e.g., "Class 12", "BSc", "M.Tech", "PhD").
`SUICIDAL_THOUGHTS`	VARCHAR(16777216)	Presence of suicidal thoughts (Yes/No).
`WORK_STUDY_HOURS`	NUMBER(38,0)	Hours spent on work or study per day (0-12 in sample).
`FINANCIAL_STRESS`	NUMBER(38,0)	Self-reported financial stress level (1-5 scale).
`FAMILY_HISTORY`	VARCHAR(16777216)	Family history of mental illness (Yes/No).
`DEPRESSION`	NUMBER(38,0)	Depression indicator (0 = No, 1 = Yes).

Sample Data Insights

Demographics: Ages range from 18 to 39, with a mix of male (60%) and female (40%) students in the sample.
Academic Pressure: Varies from 1 to 5, with some correlation to CGPA (e.g., higher pressure often linked to lower CGPA).
Depression Prevalence: Approximately 60% of the sample reports depression (147 "Yes" out of 242).
Sleep and Diet: "Less than 5 hours" and "Unhealthy" dietary habits are common among those with depression.

Results Supported by the Dataset

This dataset drives the following outputs in the pipeline:

Bronze Layer: Raw data ingested into BRONZE_STUDENT_DATA (see schema above).
Silver Layer: Cleaned and standardized data in SILVER_STUDENT_DATA, removing inconsistencies (e.g., handling rare non-student professions).
Gold Layer: Aggregated insights in GOLD_STUDENT_INSIGHTS, such as depression rates by gender, CGPA vs. academic pressure trends.
Visualizations: Plots like depression_rate.png, cgpa_pressure.png, and age_distribution.png in example/.
Prediction Model: A Random Forest Classifier trained on features (e.g., ACADEMIC_PRESSURE, CGPA, SLEEP_DURATION) to predict DEPRESSION, saved as model/depression_model.joblib.

See the root README.md and example/README.md for detailed results and visualizations.

Potential Value-Adding Insights

Based on the dataset’s structure and sample, here are suggestions to enhance its utility:

Feature Enrichment:
- Add Stress Coping Mechanisms (e.g., exercise, meditation) to explore protective factors against depression.
- Include Social Support (e.g., friends, family) as a variable, as it often influences mental health outcomes.
Granularity:
- Break down SLEEP_DURATION into numeric hours (e.g., 4, 6, 8) for finer statistical analysis.
- Categorize DEGREE into levels (e.g., High School, Undergraduate, Postgraduate) for trend analysis.
External Correlation:
- Link city data to socioeconomic indices (e.g., cost of living in Bangalore vs. Srinagar) to assess environmental impact on stress.
- Incorporate academic calendar events (e.g., exam periods) to contextualize ACADEMIC_PRESSURE.
Model Improvement:
- Use additional features like FAMILY_HISTORY and SUICIDAL_THOUGHTS to improve depression prediction accuracy.
- Explore time-series analysis if longitudinal data (e.g., repeated measures per student) could be collected.

These enhancements could deepen insights into student mental health drivers and improve the predictive power of the model, benefiting stakeholders like educators, counselors, and policymakers.

How to Use

Place student_depression_dataset.csv in this directory (data/).
Run the pipeline starting with ingestion:
- python [code/ingest.py](../code/ingest.py) to load data into Snowflake.
- Follow steps in the root README.md for processing, visualization, and modeling.

Notes

Ethical Use: This dataset involves sensitive mental health information. Ensure compliance with privacy standards (e.g., anonymization) in real-world applications.

For further details, refer to the pipeline scripts in code/ and outputs in example/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Directory

Contents

Dataset Overview

Sample Data Insights

Results Supported by the Dataset

Potential Value-Adding Insights

How to Use

Notes

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Data Directory

Contents

Dataset Overview

Sample Data Insights

Results Supported by the Dataset

Potential Value-Adding Insights

How to Use

Notes