This directory contains the core Python scripts that implement the data pipeline and machine learning model for the student depression dataset analysis project. These scripts handle data ingestion, processing, visualization, and predictive modeling, interacting with Snowflake for data storage and retrieval.
-
ingest.py- Description: Ingests raw data from
data/student_depression_dataset.csvinto the SnowflakeBRONZE_STUDENT_DATAtable and tracks lineage metadata inDATA_LINEAGE. - Usage:
python code/ingest.py
- Description: Ingests raw data from
-
process.py- Description: Transforms bronze data into silver (
SILVER_STUDENT_DATA) and gold (GOLD_STUDENT_INSIGHTS) layers in Snowflake with cleaning and aggregation steps. - Usage:
python code/process.py
- Description: Transforms bronze data into silver (
-
visualize.py- Description: Generates visualizations (bar plots, scatter plots, histograms) from the gold and silver layers, saving them to
example/. - Usage:
python code/visualize.py
- Description: Generates visualizations (bar plots, scatter plots, histograms) from the gold and silver layers, saving them to
-
model.py- Description: Trains a Random Forest Classifier to predict depression using features from the
SILVER_STUDENT_DATAtable (e.g., age, academic pressure, gender). Saves the trained model tomodel/depression_model.joblib. - Usage:
python code/model.py - Output: A trained model file (
model/depression_model.joblib) and performance metrics logged topipeline.log.
- Description: Trains a Random Forest Classifier to predict depression using features from the
-
config.ini- Description: Configuration file with Snowflake credentials and settings (e.g., user, password, account, database).
- Note: Ensure this file is populated with valid credentials before running the pipeline.
-
__init__.py- Description: Empty file to make
code/a Python package, enabling modular imports if needed.
- Description: Empty file to make
The scripts in this directory form a comprehensive data pipeline and analysis system for student depression data:
- Ingestion: Loads raw CSV data into Snowflake’s bronze layer (
ingest.py). - Processing: Cleans and aggregates data into silver and gold layers (
process.py). - Visualization: Produces visual insights saved in
example/(visualize.py). - Modeling: Trains a machine learning model to predict depression based on cleaned data (
model.py).
-
Setup:
- Install dependencies:
pip install snowflake-connector-python pandas sqlalchemy snowflake-sqlalchemy matplotlib seaborn scikit-learn joblib
- Configure
config.iniwith your Snowflake credentials.
- Install dependencies:
-
Run the Pipeline:
python code/ingest.py # Ingest raw data python code/process.py # Process data into silver and gold layers python code/visualize.py # Generate visualizations python code/model.py # Train and save the ML model