Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Code Directory

This directory contains the core Python scripts that implement the data pipeline and machine learning model for the student depression dataset analysis project. These scripts handle data ingestion, processing, visualization, and predictive modeling, interacting with Snowflake for data storage and retrieval.

Contents

  • ingest.py

    • Description: Ingests raw data from data/student_depression_dataset.csv into the Snowflake BRONZE_STUDENT_DATA table and tracks lineage metadata in DATA_LINEAGE.
    • Usage: python code/ingest.py
  • process.py

    • Description: Transforms bronze data into silver (SILVER_STUDENT_DATA) and gold (GOLD_STUDENT_INSIGHTS) layers in Snowflake with cleaning and aggregation steps.
    • Usage: python code/process.py
  • visualize.py

    • Description: Generates visualizations (bar plots, scatter plots, histograms) from the gold and silver layers, saving them to example/.
    • Usage: python code/visualize.py
  • model.py

    • Description: Trains a Random Forest Classifier to predict depression using features from the SILVER_STUDENT_DATA table (e.g., age, academic pressure, gender). Saves the trained model to model/depression_model.joblib.
    • Usage: python code/model.py
    • Output: A trained model file (model/depression_model.joblib) and performance metrics logged to pipeline.log.
  • config.ini

    • Description: Configuration file with Snowflake credentials and settings (e.g., user, password, account, database).
    • Note: Ensure this file is populated with valid credentials before running the pipeline.
  • __init__.py

    • Description: Empty file to make code/ a Python package, enabling modular imports if needed.

Project Overview

The scripts in this directory form a comprehensive data pipeline and analysis system for student depression data:

  1. Ingestion: Loads raw CSV data into Snowflake’s bronze layer (ingest.py).
  2. Processing: Cleans and aggregates data into silver and gold layers (process.py).
  3. Visualization: Produces visual insights saved in example/ (visualize.py).
  4. Modeling: Trains a machine learning model to predict depression based on cleaned data (model.py).

Instructions

  1. Setup:

    • Install dependencies:
      pip install snowflake-connector-python pandas sqlalchemy snowflake-sqlalchemy matplotlib seaborn scikit-learn joblib
    • Configure config.ini with your Snowflake credentials.
  2. Run the Pipeline:

    python code/ingest.py    # Ingest raw data
    python code/process.py   # Process data into silver and gold layers
    python code/visualize.py # Generate visualizations
    python code/model.py     # Train and save the ML model