Skip to content

Uni_crawler is a Python + Dagster pipeline that scrapes and cleans academic program data from universities, ensuring structured, high-quality outputs for easier research and analysis.

Notifications You must be signed in to change notification settings

KubangPawis/uni-crawler

Repository files navigation

uni_crawler banner

uni_crawler

📖 Overview

uni_crawler automates the collection, validation, and export of university program data (including PEOs) from institutional websites. The goal is to make it easier for prospective students to compare offerings across schools and for analysts to work with structured, validated data.

Using Dagster, this project emphasizes a clean ETL design with clear separation of concerns:

  • Extract: scrape program and PEO information from university pages
  • Transform: normalize and clean text & schema
  • Validate: enforce schema, types, completeness, and deduplicate
  • Load (export): write tidy CSVs for analysis or future database loading

For a more detailed write-up of the project background and design choices, see the full documentation.

✨ Key Features

  • Two institutional sources supported initially: MSEUF and CEFI
  • Robust data quality checks: missing values, type validation, uniqueness, schema completeness
  • Modular Dagster assets: scraping, validation, and exporting are orchestrated independently
  • Reproducible environment: Python venv and requirements.txt
  • Monthly schedule (configurable): aligns with low-volatility academic updates

🛠️ Tech Stack

  • Language: Python
  • Web Scraping: BeautifulSoup4
  • Data Manipulation: Pandas
  • ETL Orchestration: Dagster
  • Version Control: Git/GitHub

👋 Prerequisites

To setup this project locally, ensure you have:

Requirement Version
Python 3.11+
Git 2.30+

🏁 Getting Started

The following indicates a quick step-by-step to run the project using Command Prompt.

REM 1) Clone the repo (pick one)
git clone git@github.com:KubangPawis/uni-crawler.git
git clone https://github.com/KubangPawis/uni-crawler.git

REM 2) Go into the project folder
cd /d C:\path\to\uni_crawler

REM 3) Create and activate a Python 3.11 virtual environment
py -3.11 -m venv venv
venv/Scripts/Activate.ps1

REM 4) Install dependencies
python -m pip install -U pip
python -m pip install -r requirements.txt

REM 5) Launch Dagster's dev UI
dagster dev

🪙 Data Sources

Initial institutional sources:

  • Manuel S. Enverga University Foundation (MSEUF)
  • Calayan Education Foundation Inc. (CEFI)

These sites were chosen for their structured presentation of academic program information and breadth of programs.

⚙️ Pipeline

Below is an illustration of the project pipeline:

uni_crawler - Pipeline

🔍 Pipeline Definition (Dagster)

The image below represents the pipeline job and definition managed through Dagster:

uni_crawler - Dagster Pipeline Definition

🎯 Outputs

Expected CSVs after a successful run:

  • univ_programs_data.csv – 5 columns (e.g., 135 records in a reference run)

  • univ_programs_peo_data.csv – 3 columns (e.g., 469 records in a reference run)

Files are written to the project’s data/output directory (check your repo paths/config).

✅ Ethical Web Scraping

Uni_crawler is designed to follow ethical practices:

  • Respect robots.txt and Terms of Service
  • Throttle requests (delays/backoff)
  • Collect only non-personal, publicly available information
  • Keep request volume reasonable

Scraped page patterns (examples):

  • MSEUF: /programs/, /programs/{campus_name}/{program_name}
  • CEFI: /programs/, /{department_name}

Please review individual site terms before adding new sources.

🛣️ Roadmap

  • Add more institutions/sources
  • Implement database load (PostgreSQL/MySQL)
  • Add CLI flags/config file for source selection and output paths
  • Publish a small API or dashboard for browsing programs/PEOs
  • CI checks for linting, type checks, and data quality gates

💁 Contributing

Contributions are welcome!

  1. Fork the repo
  2. Create a feature branch: git checkout -b feat/my-change
  3. Commit with clear messages
  4. Open a Pull Request describing your change and its rationale

👍 Acknowledgments

  • Built with Dagster, BeautifulSoup4, and Pandas
  • Thanks to the institutions whose publicly available pages make student research easier

About

Uni_crawler is a Python + Dagster pipeline that scrapes and cleans academic program data from universities, ensuring structured, high-quality outputs for easier research and analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages